Data mining Project - University of Pisa, acedemic year 2023/24

Authors: Giacomo Aru, Giulia Ghisolfi, Luca Marini, Irene Testa

Task 1 - Incidents Data Understanding and Preparation¶

We import the libraries:

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt, mpld3
from matplotlib.legend_handler import HandlerPathCollection
import seaborn as sns
import matplotlib.dates as mdates
import plotly.express as px
import plotly.offline as pyo
import plotly.subplots as sp
from plotly_calplot import calplot
import plotly.graph_objs as go
import math
import os
import calendar
import sys
from sklearn.neighbors import KNeighborsClassifier
from geopy import distance as geopy_distance
import nltk
from nltk.corpus import stopwords
from wordcloud import WordCloud
from pyproj import Transformer
import zipfile
import builtins
sys.path.append(os.path.abspath('..'))
from plot_utils import *

We define constants and settings for the notebook:

In [ ]:
%matplotlib inline

DATA_FOLDER_PATH = '../data/'

pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', None)

We load the datasets:

In [ ]:
incidents_path = DATA_FOLDER_PATH + 'incidents.csv'
elections_path = DATA_FOLDER_PATH + 'year_state_district_house_cleaned.csv'
poverty_path = DATA_FOLDER_PATH + 'poverty_by_state_year_cleaned.csv'
incidents_df = pd.read_csv(incidents_path, low_memory=False)
elections_df = pd.read_csv(elections_path, low_memory=False)
poverty_df = pd.read_csv(poverty_path, low_memory=False)

We assess the correct loading of the dataset printing 2 random rows:

In [ ]:
incidents_df.sample(n=2, random_state=1)
Out[ ]:
date state city_or_county address latitude longitude congressional_district state_house_district state_senate_district participant_age1 participant_age_group1 participant_gender1 min_age_participants avg_age_participants max_age_participants n_participants_child n_participants_teen n_participants_adult n_males n_females n_killed n_injured n_arrested n_unharmed n_participants notes incident_characteristics1 incident_characteristics2
8141 2015-12-10 Illinois Chicago 2700 block of South Lawndale 41.8423 -87.7174 4.0 24.0 12.0 20.0 Adult 18+ Male 20.0 20.0 20.0 0.0 0.0 1.0 1.0 0.0 0 1 0.0 0.0 1.0 Little Village - documented gang member shot multiple times, critical Shot - Wounded/Injured Gang involvement
64095 2017-11-05 Hawaii Honolulu NaN 21.3061 -157.8610 1.0 26.0 13.0 35.0 Adult 18+ Male 35.0 35.0 35.0 0.0 0.0 1.0 1.0 0.0 0 1 0.0 0.0 1.0 dropped off at hospital, unclear where he was shot Shot - Wounded/Injured NaN

This dataset contains information about gun incidents in the USA.

In the following table we provide the characteristics of each attribute of the dataset. To define the type of the attributes we used the categorization described by Pang-Ning Tan, Michael Steinbach and Vipin Kumar in the book Introduction to Data Mining. For each attribute, we also reported the desidered pandas dtype for later analysis.

# Name Type Description Desired dtype
0 date Numeric (Interval) Date of incident occurrence datetime
1 state Categorical (Nominal) Dtate where incident took place object
2 city_or_county Categorical (Nominal) City or county where incident took place object
3 address Categorical (Nominal) Address where incident took place object
4 latitude Numeric (Interval) Latitude of the incident float64
5 longitude Numeric (Interval) Longitude of the incident float64
6 congressional_district Categorical (Nominal) Congressional district where the incident took place int64
7 state_house_district Categorical (Nominal) State house district int64
8 state_senate_district Categorical (Nominal) State senate district where the incident took place int64
9 participant_age1 Numeric (Ratio) Exact age of one (randomly chosen) participant in the incident int64
10 participant_age_group1 Categorical (Ordinal) Exact age group of one (randomly chosen) participant in the incident object
11 participant_gender1 Categorical (Nominal) Exact gender of one (randomly chosen) participant in the incident object
12 min_age_participants Numeric (Ratio) Minimum age of the participants in the incident int64
13 avg_age_participants Numeric (Ratio) Average age of the participants in the incident float64
14 max_age_participants Numeric (Ratio) Maximum age of the participants in the incident int64
15 n_participants_child Numeric (Ratio) Number of child participants 0-11 int64
16 n_participants_teen Numeric (Ratio) Number of teen participants 12-17 int64
17 n_participants_adult Numeric (Ratio) Number of adult participants (18 +) int64
18 n_males Numeric (Ratio) Number of males participants int64
19 n_females Numeric (Ratio) Number of females participants int64
20 n_killed Numeric (Ratio) Number of people killed int64
21 n_injured Numeric (Ratio) Number of people injured int64
22 n_arrested Numeric (Ratio) Number of arrested participants int64
23 n_unharmed Numeric (Ratio) Number of unharmed participants int64
24 n_participants Numeric (Ratio) Number of participants in the incident int64
25 notes Categorical (Nominal) Additional notes about the incident object
26 incident_characteristics1 Categorical (Nominal) Incident characteristics object
27 incident_characteristics2 Categorical (Nominal) Incident characteristics (not all incidents have two available characteristics) object

We display a concise summary of the DataFrame:

In [ ]:
incidents_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 239677 entries, 0 to 239676
Data columns (total 28 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   date                       239677 non-null  object 
 1   state                      239677 non-null  object 
 2   city_or_county             239677 non-null  object 
 3   address                    223180 non-null  object 
 4   latitude                   231754 non-null  float64
 5   longitude                  231754 non-null  float64
 6   congressional_district     227733 non-null  float64
 7   state_house_district       200905 non-null  float64
 8   state_senate_district      207342 non-null  float64
 9   participant_age1           147379 non-null  float64
 10  participant_age_group1     197558 non-null  object 
 11  participant_gender1        203315 non-null  object 
 12  min_age_participants       164879 non-null  object 
 13  avg_age_participants       165057 non-null  object 
 14  max_age_participants       164969 non-null  object 
 15  n_participants_child       197573 non-null  object 
 16  n_participants_teen        197578 non-null  object 
 17  n_participants_adult       197575 non-null  object 
 18  n_males                    203315 non-null  float64
 19  n_females                  203315 non-null  float64
 20  n_killed                   239677 non-null  int64  
 21  n_injured                  239677 non-null  int64  
 22  n_arrested                 212051 non-null  float64
 23  n_unharmed                 212051 non-null  float64
 24  n_participants             239677 non-null  float64
 25  notes                      158660 non-null  object 
 26  incident_characteristics1  239351 non-null  object 
 27  incident_characteristics2  141931 non-null  object 
dtypes: float64(11), int64(2), object(15)
memory usage: 51.2+ MB

We notice that:

  • congressional_district, state_house_district, state_senate_district, participant_age1, n_males, n_females, n_arrested, n_unharmed, n_participants are stored as float64 while should be int64
  • min_age_participants, avg_age_participants, max_age_participants, n_participants_child, n_participants_teen, n_participants_adult are stored as object while should be int64, this probably indicates the presence of syntactic errors (not in the domain)
  • the presence of missing values within many attributes; the only attributes without missing values are the following: date, state, city_or_county, n_killed, n_injured, n_participants

We display descriptive statistics of the DataFrame so to better understand how to cast the data:

In [ ]:
incidents_df.describe(include='all')
Out[ ]:
date state city_or_county address latitude longitude congressional_district state_house_district state_senate_district participant_age1 participant_age_group1 participant_gender1 min_age_participants avg_age_participants max_age_participants n_participants_child n_participants_teen n_participants_adult n_males n_females n_killed n_injured n_arrested n_unharmed n_participants notes incident_characteristics1 incident_characteristics2
count 239677 239677 239677 223180 231754.000000 231754.000000 227733.000000 200905.000000 207342.00000 147379.000000 197558 203315 164879 165057 164969 197573 197578 197575 203315.000000 203315.000000 239677.000000 239677.000000 212051.000000 212051.000000 239677.000000 158660 239351 141931
unique 2437 51 12898 198037 NaN NaN NaN NaN NaN NaN 3 3 12673 12869 12852 25 33 50 NaN NaN NaN NaN NaN NaN NaN 136652 52 90
top 2017-01-01 Illinois Chicago 2375 International Pkwy NaN NaN NaN NaN NaN NaN Adult 18+ Male 19.0 22.0 24.0 0.0 0.0 1.0 NaN NaN NaN NaN NaN NaN NaN man shot Shot - Wounded/Injured Officer Involved Incident
freq 342 17556 10814 160 NaN NaN NaN NaN NaN NaN 181324 177945 7981 7606 6237 193852 178223 106461 NaN NaN NaN NaN NaN NaN NaN 501 93926 13881
mean NaN NaN NaN NaN 37.546598 -89.338348 8.001265 55.447132 20.47711 30.295707 NaN NaN NaN NaN NaN NaN NaN NaN 1.520252 0.212340 0.252290 0.494007 0.468439 0.494169 1.636895 NaN NaN NaN
std NaN NaN NaN NaN 5.130763 14.359546 8.480835 42.048117 14.20456 13.363592 NaN NaN NaN NaN NaN NaN NaN NaN 0.996767 0.490888 0.521779 0.729952 0.851035 0.925566 1.252514 NaN NaN NaN
min NaN NaN NaN NaN 19.111400 -171.429000 0.000000 1.000000 1.00000 0.000000 NaN NaN NaN NaN NaN NaN NaN NaN 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 NaN NaN NaN
25% NaN NaN NaN NaN 33.903400 -94.158725 2.000000 21.000000 9.00000 21.000000 NaN NaN NaN NaN NaN NaN NaN NaN 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 NaN NaN NaN
50% NaN NaN NaN NaN 38.570600 -86.249600 5.000000 47.000000 19.00000 27.000000 NaN NaN NaN NaN NaN NaN NaN NaN 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 NaN NaN NaN
75% NaN NaN NaN NaN 41.437375 -80.048625 10.000000 84.000000 30.00000 37.000000 NaN NaN NaN NaN NaN NaN NaN NaN 2.000000 0.000000 0.000000 1.000000 1.000000 1.000000 2.000000 NaN NaN NaN
max NaN NaN NaN NaN 71.336800 97.433100 53.000000 901.000000 94.00000 311.000000 NaN NaN NaN NaN NaN NaN NaN NaN 61.000000 23.000000 50.000000 53.000000 63.000000 20.000000 103.000000 NaN NaN NaN

We cast the attributes to the correct type:

In [ ]:
# NUMERIC ATTRIBUTES

numerical_features = [
    'participant_age1',
    'n_males',
    'n_females',
    'n_killed',
    'n_injured',
    'n_arrested',
    'n_unharmed',
    'n_participants',
    'min_age_participants',
    'avg_age_participants',
    'max_age_participants',
    'n_participants_child',
    'n_participants_teen',
    'n_participants_adult',
    # (the following attributes should be categorical, but for convenience we keep them numeric)
    'congressional_district',
    'state_house_district',
    'state_senate_district'
    ]
incidents_df[numerical_features] = incidents_df[numerical_features].apply(pd.to_numeric, errors='coerce')

# DATE
incidents_df['date'] = pd.to_datetime(incidents_df['date'], format='%Y-%m-%d')

# CATEGORICAL ATTRIBUTES
# nominal
incidents_df['participant_gender1'] = incidents_df['participant_gender1'].astype("category")
# ordinal
incidents_df['participant_age_group1'] = incidents_df['participant_age_group1'].astype(
    pd.api.types.CategoricalDtype(categories = ["Child 0-11", "Teen 12-17", "Adult 18+"], ordered = True))

We display again information about the dataset to check the correctness of the casting and the number of missing values:

In [ ]:
incidents_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 239677 entries, 0 to 239676
Data columns (total 28 columns):
 #   Column                     Non-Null Count   Dtype         
---  ------                     --------------   -----         
 0   date                       239677 non-null  datetime64[ns]
 1   state                      239677 non-null  object        
 2   city_or_county             239677 non-null  object        
 3   address                    223180 non-null  object        
 4   latitude                   231754 non-null  float64       
 5   longitude                  231754 non-null  float64       
 6   congressional_district     227733 non-null  float64       
 7   state_house_district       200905 non-null  float64       
 8   state_senate_district      207342 non-null  float64       
 9   participant_age1           147379 non-null  float64       
 10  participant_age_group1     197558 non-null  category      
 11  participant_gender1        203315 non-null  category      
 12  min_age_participants       159126 non-null  float64       
 13  avg_age_participants       159168 non-null  float64       
 14  max_age_participants       159084 non-null  float64       
 15  n_participants_child       197568 non-null  float64       
 16  n_participants_teen        197571 non-null  float64       
 17  n_participants_adult       197572 non-null  float64       
 18  n_males                    203315 non-null  float64       
 19  n_females                  203315 non-null  float64       
 20  n_killed                   239677 non-null  int64         
 21  n_injured                  239677 non-null  int64         
 22  n_arrested                 212051 non-null  float64       
 23  n_unharmed                 212051 non-null  float64       
 24  n_participants             239677 non-null  float64       
 25  notes                      158660 non-null  object        
 26  incident_characteristics1  239351 non-null  object        
 27  incident_characteristics2  141931 non-null  object        
dtypes: category(2), datetime64[ns](1), float64(17), int64(2), object(6)
memory usage: 48.0+ MB

We drop duplicates:

In [ ]:
print(f"# of rows before dropping duplicates: {incidents_df.shape[0]}")
incidents_df.drop_duplicates(inplace=True)
print(f"# of rows after dropping duplicates: {incidents_df.shape[0]}")
# of rows before dropping duplicates: 239677
# of rows after dropping duplicates: 239381

Now we visualize missing values:

In [ ]:
fig, ax = plt.subplots(figsize=(12,8)) 
sns.heatmap(incidents_df.isnull(), cbar=False, xticklabels=True, ax=ax)
ax.set_title('Null values Heatmap')
ax.set_ylabel('row index')
Out[ ]:
Text(120.72222222222221, 0.5, 'row index')

We observe that:

  • The following attributes are missing together:
    • latitude and longitude
    • n_participants_child, n_participants_teen, n_participants_adult
    • n_males, n_females
    • n_arrested, n_unharmed
  • There are many missing values for the following attributes:
    • participant_age1
    • min_age_participants, avg_age_participants, max_age_participants (often missing together but not always)
    • notes
    • incident_characteristics2
  • Often participant_age1 is missing but participant_age_group1 is not and the same holds for state_house_district w.r.t state_senate_district.
  • latitude and longitude are often available and could be used to recover the missing values of address, congressional_district, state_house_district and state_senate_district (using external libraries that offer this service).

We display descriptive statistics:

In [ ]:
incidents_df.describe(include='all')
/var/folders/8n/f3j9y6bd3zd_kp83y0vvn_b80000gn/T/ipykernel_5016/1539327463.py:1: FutureWarning:

Treating datetime data as categorical rather than numeric in `.describe` is deprecated and will be removed in a future version of pandas. Specify `datetime_is_numeric=True` to silence this warning and adopt the future behavior now.

Out[ ]:
date state city_or_county address latitude longitude congressional_district state_house_district state_senate_district participant_age1 participant_age_group1 participant_gender1 min_age_participants avg_age_participants max_age_participants n_participants_child n_participants_teen n_participants_adult n_males n_females n_killed n_injured n_arrested n_unharmed n_participants notes incident_characteristics1 incident_characteristics2
count 239381 239381 239381 222934 231458.000000 231458.000000 227438.000000 200648.000000 207075.000000 147360.000000 197508 203266 1.591070e+05 1.591490e+05 1.590650e+05 197518.000000 197521.000000 197522.000000 203266.000000 203266.000000 239381.000000 239381.000000 211988.000000 211988.000000 239381.000000 158558 239055 141775
unique 2437 51 12898 198037 NaN NaN NaN NaN NaN NaN 3 3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 136652 52 90
top 2017-01-01 00:00:00 Illinois Chicago 2375 International Pkwy NaN NaN NaN NaN NaN NaN Adult 18+ Male NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN man shot Shot - Wounded/Injured Officer Involved Incident
freq 342 17555 10813 148 NaN NaN NaN NaN NaN NaN 181274 177899 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 501 93910 13879
first 2013-01-01 00:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
last 2030-11-28 00:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
mean NaN NaN NaN NaN 37.548820 -89.335650 7.997366 55.439910 20.472819 30.295392 NaN NaN 5.675037e+06 2.446473e+04 1.871629e+04 16.126085 8.464052 18.520398 1.520338 0.212377 0.252589 0.494538 0.468456 0.494179 1.638589 NaN NaN NaN
std NaN NaN NaN NaN 5.130140 14.359154 8.477303 42.050721 14.206066 13.363816 NaN NaN 2.256305e+09 2.199620e+06 1.071589e+05 3295.134752 2224.207250 3233.125322 0.996845 0.490928 0.522020 0.730173 0.851113 0.925628 1.252204 NaN NaN NaN
min NaN NaN NaN NaN 19.111400 -171.429000 0.000000 1.000000 1.000000 0.000000 NaN NaN -1.000000e+03 -1.000000e+03 -1.000000e+03 -977.000000 -947.000000 -991.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 NaN NaN NaN
25% NaN NaN NaN NaN 33.905725 -94.144100 2.000000 21.000000 9.000000 21.000000 NaN NaN 1.900000e+01 2.100000e+01 2.100000e+01 0.000000 0.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 NaN NaN NaN
50% NaN NaN NaN NaN 38.572950 -86.248800 5.000000 47.000000 19.000000 27.000000 NaN NaN 2.500000e+01 2.700000e+01 2.800000e+01 0.000000 0.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 NaN NaN NaN
75% NaN NaN NaN NaN 41.440950 -80.048000 10.000000 84.000000 30.000000 37.000000 NaN NaN 3.500000e+01 3.700000e+01 4.000000e+01 0.000000 0.000000 2.000000 2.000000 0.000000 0.000000 1.000000 1.000000 1.000000 2.000000 NaN NaN NaN
max NaN NaN NaN NaN 71.336800 97.433100 53.000000 901.000000 94.000000 311.000000 NaN NaN 9.000000e+11 8.719163e+08 8.300000e+06 886365.000000 762487.000000 827900.000000 61.000000 23.000000 50.000000 53.000000 63.000000 20.000000 103.000000 NaN NaN NaN

We can already make some considerations about the dataset:

  • incidents happened in 51 different states (we probably have at least one incident for each state)
  • the most frequent value for the attrbute state is Illinois and the most frequent value for city_or_county is Chicago (which is in Illinois, it is consistent)
  • 148 incidents happened at the address "2375 International Pkwy" (an airport in Dallas, Texsas)
  • the majority of incidents involved males
  • there are 52 unique values for the attribute incidents_characteristics1 and the most frequent is "Shot - Wounded/Injured" (at the time of data collection, it is likely that the values this attribute could take on were limited to a predefined set)
  • there are 90 unique values for the attribute incidents_characteristicsch2 and the most frequent is "Officer Involved Incident"; this attribute presents more missing values than incidents_characteristics1 (at the time of data collection, it is likely that the values this attribute could take on were limited to a predefined set)
  • the most frequent value for the attribute notes is "man shot", but the number of unique values this attribute assumes is very high (at the time of data collection the values this attribute could take on were not defined)
  • there are many inconsistencies and/or erros, for example:
    • the maximum value for the attribute date is 2030-11-28
    • the range of the attributes age, min_age_participants, avg_age_participants, max_age_participants, n_participants_child, n_participants_teen, n_participants_adult is outside the domain of the attributes (e.g. the maximum value for the attribute age is 311)

In the following sections of this notebook we will analyze each attribute in detail.

To avoid re-running some cells, we save checkpoints of the dataframe at different stages of the analysis and load the dataframe from the last checkpoint using the following functions:

In [ ]:
LOAD_DATA_FROM_CHECKPOINT = True
CHECKPOINT_FOLDER_PATH = 'checkpoints/'

def save_checkpoint(df, checkpoint_name):
    df.to_csv(CHECKPOINT_FOLDER_PATH + checkpoint_name + '.csv')

def load_checkpoint(checkpoint_name, casting=None, date_cols=None):
    df = pd.read_csv(
        CHECKPOINT_FOLDER_PATH + checkpoint_name + '.csv',
        low_memory=False,
        index_col=0,
        parse_dates=date_cols,
        dtype=casting
        )
    return df

Date attribute: exploration and preparation¶

We plot the distribution of the dates using different binning strategies:

In [ ]:
def plot_dates(df_column, title=None, color=None):
    def iqr_fence(x):
        q1 = x.quantile(0.25)
        med = x.quantile(0.5)
        q3 = x.quantile(0.75)
        IQR = q3 - q1
        u = x.max()
        l = x.min()
        Lower_Fence = builtins.max(builtins.min(x[x > q1 - (1.5 * IQR)], default=pd.Timestamp.min), l)
        Upper_Fence = builtins.min(builtins.max(x[x < q3 + (1.5 * IQR)], default=pd.Timestamp.max), u)
        return [Lower_Fence, q1, med, q3, Upper_Fence]
    relevant_positions = iqr_fence(df_column)
    n_items = len(df_column.index)
    min = df_column.min()
    max = df_column.max()

    fig, axs = plt.subplots(3, sharex=True, figsize=(14, 6))
    fig.suptitle(title)

    # one bin per month
    n_bin = int((max - min).days / 30)
    axs[0].hist(df_column, bins=n_bin, density=True, color=color)
    axs[0].set_ylabel("One bin per month")
    axs[0].grid(axis='y')

    # number of bins computed using Sturge's rule
    n_bin = int(1 + math.log2(n_items))
    axs[1].hist(df_column, bins=n_bin, density=True, color=color)
    axs[1].set_ylabel("Sturge\'s rule binning")
    axs[1].grid(axis='y')

    axs[2].boxplot(x=mdates.date2num(df_column), labels=[''], vert=False)
    axs[2].set_xlabel('date')

    for i in range(2):
        axs[i].axvline(x = relevant_positions[0], color = 'black', linestyle = '--', alpha=0.75)
        axs[i].axvline(x = relevant_positions[1], color = 'black', linestyle = '-.', alpha=0.75)
        axs[i].axvline(x = relevant_positions[2], color = 'black', linestyle = '-.', alpha=0.75)
        axs[i].axvline(x = relevant_positions[3], color = 'black', linestyle = '-.', alpha=0.75)
        axs[i].axvline(x = relevant_positions[4], color = 'black', linestyle = '--', alpha=0.75)



plot_dates(incidents_df['date'], title='Dates distribution')
print('Range data: ', incidents_df['date'].min(), ' - ', incidents_df['date'].max())
print('Unique years: ', sorted(incidents_df['date'].dt.year.unique()))
num_oor = incidents_df[incidents_df['date'].dt.year>2018].shape[0]
print(f'Number of rows with out of range value for the attribute date: {num_oor} ({num_oor/incidents_df.shape[0]*100:.2f}%)')
Range data:  2013-01-01 00:00:00  -  2030-11-28 00:00:00
Unique years:  [2013, 2014, 2015, 2016, 2017, 2018, 2028, 2029, 2030]
Number of rows with out of range value for the attribute date: 23008 (9.61%)

These plots show that the number of incidents with an out of range value for the attribute date is non negligible (9.6%) and, excluding these points, there are no incidents happened after the year 2018. Instead of discarding rows with out-of-range dates, we will try to correct the errors to prevent excessive data loss. Since there are no other features that could suggest the timeframe of the incident, we can only proceed using one of the following approaches:

  • check if those records have duplicates with a correct date
  • suppose dates were entered manually using a numeric keypad and that the errors are typos (e.g. 2030 is actually 2020)
  • replace the errors with the mean or median value

Let's check if there are duplicates with a correct date:

In [ ]:
incidents_future = incidents_df[incidents_df['date'].dt.year>2018].drop(columns=['date'])
incidents_past = incidents_df[incidents_df['date'].dt.year<2019].drop(columns=['date'])
incidents_past[incidents_past.isin(incidents_future).any(axis=1)].size!=0
Out[ ]:
False

Since there are no duplicates, we proceed with the second and third approach:

In [ ]:
incidents_df['year'] = incidents_df['date'].dt.year
mean_date = incidents_df[incidents_df['year']<2019]['date'].mean()
median_date = incidents_df[incidents_df['year']<2019]['date'].median()

incidents_df['date_minus10'] = incidents_df['date']
incidents_df['date_minus10'] = incidents_df['date'].apply(lambda x : x - pd.DateOffset(years=10) if x.year>2018 else x)
incidents_df['date_minus11'] = incidents_df['date']
incidents_df['date_minus11'] = incidents_df['date'].apply(lambda x : x - pd.DateOffset(years=11) if x.year>2018 else x)
incidents_df['date_mean'] = incidents_df['date']
incidents_df['date_mean'] = incidents_df['date'].apply(lambda x : mean_date if x.year>2018 else x)
incidents_df['date_mean'] = pd.to_datetime(incidents_df['date_mean'], format='%Y-%m-%d') # discard hours, minutes and seconds
incidents_df['date_median'] = incidents_df['date']
incidents_df['date_median'] = incidents_df['date'].apply(lambda x : median_date if x.year>2018 else x)

plot_dates(incidents_df['date_minus10'], 'Dates distribution (year - 10 for oor)')
plot_dates(incidents_df['date_minus11'], 'Dates distribution (year - 11 for oor)', color='orange')
plot_dates(incidents_df['date_mean'], 'Dates distribution (oor replaced with mean)', color='green')
plot_dates(incidents_df['date_median'], 'Dates distribution (oor replaced with median)', color='red')

Unfortunately, these methods lead to unsatisfactory results, as they significantly alter the distribution. Therefore, we decided to split the date attribute into year, month and day, and set to nan the date attribute if larger than 2018. We also recover the day of the week in which the incident occurred.

In [ ]:
incidents_df.drop(columns=['date_minus10', 'date_minus11', 'date_mean', 'date_median'], inplace=True)
incidents_df['date_original'] = incidents_df['date']
incidents_df['date'] = incidents_df['date'].apply(lambda x : pd.NaT if x.year>2018 else x)
incidents_df['year'] = incidents_df['date'].dt.year.astype('UInt64')
incidents_df['month'] = incidents_df['date'].dt.month.astype('UInt64')
incidents_df['month_name'] = incidents_df['date'].dt.month_name()
incidents_df['day'] = incidents_df['date'].dt.day.astype('UInt64')
incidents_df['day_of_week'] = incidents_df['date'].dt.dayofweek.astype('UInt64')
incidents_df['day_of_week_name'] = incidents_df['date'].dt.day_name()

We visualize the number of incidents per month:

In [ ]:
incidents_df.groupby('month').size().plot(
    kind='bar',
    figsize=(10, 5),
    title='Number of incidents per month',
    xlabel='Month',
    ylabel='Number of incidents'
)
plt.xticks(range(12), calendar.month_name[1:13], rotation=45);
plt.savefig("../html/incidents_per_month.svg")
In [ ]:
# display the months with a number of incidents below the 25th percentile
months = incidents_df.groupby('month').size()
months[months<months.quantile(0.25)].index
Out[ ]:
Index([4, 6, 11], dtype='UInt64', name='month')
In [ ]:
# display the months with a number of incidents above the 75th percentile
months[months>months.quantile(0.75)].index
Out[ ]:
Index([1, 3, 8], dtype='UInt64', name='month')

We visualize the number of incidents per day of the week:

In [ ]:
fig = incidents_df.groupby('day_of_week').size().plot(
    kind='bar',
    figsize=(10, 5),
    title='Number of incidents per day of the week',
    xlabel='Day of the week',
    ylabel='Number of incidents'
)
plt.xticks(range(7), calendar.day_name[0:7], rotation=45);
plt.savefig("../html/incidents_per_week_day.svg")
In [ ]:
# display the days of the week with a number of incidents below the 25th percentile
week_days = incidents_df.groupby('day_of_week').size()
week_days[week_days<week_days.quantile(0.25)].index
Out[ ]:
Index([3, 4], dtype='UInt64', name='day_of_week')
In [ ]:
# display the days of the week with a number of incidents above the 75th percentile
week_days[week_days>week_days.quantile(0.75)].index
Out[ ]:
Index([5, 6], dtype='UInt64', name='day_of_week')

We display the number of incidents per day over the years:

In [ ]:
def group_by_day(df, date_col):
    counts_by_day = df[date_col].groupby([df[date_col].dt.year, df[date_col].dt.month, df[date_col].dt.day]).size().rename_axis(['year', 'month', 'day']).to_frame('Number of incidents').reset_index()
    counts_by_day[['year', 'month', 'day']] = counts_by_day[['year', 'month', 'day']].astype('int64')
    # add missing days
    for day in pd.date_range(start='2017-01-01', end='2017-12-31'): # 2017%4!=0, this will exclude 29 February
        for year in counts_by_day['year'].unique():
            row_exists = (
                (counts_by_day['year']==year) &
                (counts_by_day['month']==day.month) &
                (counts_by_day['day']==day.day)
                ).any()
            if not row_exists:
                counts_by_day = pd.concat([
                        counts_by_day,
                        pd.DataFrame({'year': [year], 'month': [day.month], 'day': [day.day], 'Number of incidents': [0]})
                    ])
    counts_by_day.sort_values(by=['year', 'month', 'day'], inplace=True)
    counts_by_day['Day'] = counts_by_day.apply(lambda x: f'{x["day"]} {calendar.month_name[x["month"]]}', axis=1)
    return counts_by_day
In [ ]:
incidents_counts_by_day = group_by_day(
    incidents_df[~((incidents_df['date'].dt.day==29) & (incidents_df['date'].dt.month==2))], # exclude 29 february
    'date_original'
)
fig = px.line(
    incidents_counts_by_day,
    x='Day',
    y='Number of incidents',
    title='Number of incidents per day',
    labels={'Day': 'Day of the year', 'Number of incidents': 'Number of incidents'},
    facet_col='year',
    width=1200,
    height=800,
    facet_col_wrap=3
)
fig.update_xaxes(tickangle=-90)
fig.show()
pyo.plot(fig, filename='../html/incidents_per_day_line.html', auto_open=False);

We display the number of incidents per day over the years as heatmap, using a calendar plot:

In [ ]:
incidenst_per_day = incidents_df.groupby('date').size()

fig = calplot(
    incidenst_per_day.reset_index(),
    x="date",
    y=0,
    title="Number of incidents per day"
)
fig.show()
pyo.plot(fig, filename='../html/incidents_per_day_heatmap.html', auto_open=False);

We visualize the frequency of incidents occurring across different festivities. We consider the following holidays (most of them are federal holidays)

Holiday Date
New Year’s Day January 1
Martin Luther King’s Birthday 3rd Monday in January
Washington’s Birthday 3rd Monday in February
Memorial Day last Monday in May
Juneteenth National Independence Day June 19
Independence Day July 4
Labor Day 1st Monday in September
Columbus Day 2nd Monday in October
Veterans’ Day November 11
Thanksgiving Day 4th Thursday in November
Christmas Day December 25
Easter Sunday (based on moon phase)
Easter Monday Day after Easter
Black Friday Day after Thanksgiving
In [ ]:
holiday_dict = {'New Year\'s Day': ['2013-01-01', '2014-01-01', '2015-01-01', '2016-01-01', '2017-01-01', '2018-01-01'],
    'Martin Luther King\'s Day': ['2013-01-21', '2014-01-20', '2015-01-19', '2016-01-18', '2017-01-16', '2018-01-15'],
    'Washington\'s Birthday': ['2013-02-18', '2014-02-17', '2015-02-16', '2016-02-15', '2017-02-20', '2018-02-19'],
    'Saint Patrick\'s Day': ['2013-03-17', '2014-03-17', '2015-03-17', '2016-03-17', '2017-03-17', '2018-03-17'],
    'Easter': ['2013-03-31', '2014-04-20', '2015-04-05', '2016-03-27', '2017-04-16', '2018-04-01'], 
    'Easter Monday': ['2013-04-01', '2014-04-21', '2015-04-06', '2016-03-28', '2017-04-17', '2018-04-02'],
    'Memorial Day': ['2013-05-27', '2014-05-26', '2015-05-25', '2016-05-30', '2017-05-29', '2018-05-28'],
    'Juneteenth National Independence Day': ['2013-06-19', '2014-06-19', '2015-06-19', '2016-06-19', '2017-06-19', '2018-06-19'],
    'Independence Day': ['2013-07-04', '2014-07-04', '2015-07-03', '2016-07-04', '2017-07-04', '2018-07-04'],
    'Labor Day': ['2013-09-02', '2014-09-01', '2015-09-07', '2016-09-05', '2017-09-04', '2018-09-03'],
    'Columbus Day': ['2013-10-14', '2014-10-13', '2015-10-12', '2016-10-10', '2017-10-09', '2018-10-08'],
    'Veterans\' Day': ['2013-11-11', '2014-11-11', '2015-11-11', '2016-11-11', '2017-11-11', '2018-11-11'],
    'Thanksgiving Day': ['2013-11-28', '2014-11-27', '2015-11-26', '2016-11-24', '2017-11-23', '2018-11-22'],
    'Black Friday': ['2013-11-29', '2014-11-28', '2015-11-27', '2016-11-25', '2017-11-24', '2018-11-23'],
    'Christmas Day': ['2013-12-25', '2014-12-25', '2015-12-25', '2016-12-26', '2017-12-25', '2018-12-25']}

We display the number of incidents occurring on each holiday for each year:

In [ ]:
dfs = []
for holiday in holiday_dict.keys():
    holiday_data = {
        'holiday': holiday,
        'n_incidents_2013': incidents_df[incidents_df['date'].isin([holiday_dict[holiday][0]])].shape[0],
        'n_incidents_2014': incidents_df[incidents_df['date'].isin([holiday_dict[holiday][1]])].shape[0],
        'n_incidents_2015': incidents_df[incidents_df['date'].isin([holiday_dict[holiday][2]])].shape[0],
        'n_incidents_2016': incidents_df[incidents_df['date'].isin([holiday_dict[holiday][3]])].shape[0],
        'n_incidents_2017': incidents_df[incidents_df['date'].isin([holiday_dict[holiday][4]])].shape[0],
        'n_incidents_2018': incidents_df[incidents_df['date'].isin([holiday_dict[holiday][5]])].shape[0],
        'n_incidents_total': incidents_df[incidents_df['date'].isin(holiday_dict[holiday])].shape[0]
    }

    df = pd.DataFrame([holiday_data])
    dfs.append(df)
dfs.append(pd.DataFrame([{
    'holiday': 'Total incidents during the year',
    'n_incidents_2013': incidents_df[incidents_df['date'].dt.year==2013].shape[0],
    'n_incidents_2014': incidents_df[incidents_df['date'].dt.year==2014].shape[0],
    'n_incidents_2015': incidents_df[incidents_df['date'].dt.year==2015].shape[0],
    'n_incidents_2016': incidents_df[incidents_df['date'].dt.year==2016].shape[0],
    'n_incidents_2017': incidents_df[incidents_df['date'].dt.year==2017].shape[0],
    'n_incidents_2018': incidents_df[incidents_df['date'].dt.year==2018].shape[0],
    'n_incidents_total': incidents_df.shape[0]}]))
holidays_df = pd.concat(dfs, ignore_index=True)
holidays_df.sort_values(by=['n_incidents_2017'], ascending=False, inplace=True)
holidays_df
Out[ ]:
holiday n_incidents_2013 n_incidents_2014 n_incidents_2015 n_incidents_2016 n_incidents_2017 n_incidents_2018 n_incidents_total
15 Total incidents during the year 253 37592 44614 58724 61389 13801 239381
0 New Year's Day 3 171 185 221 342 242 1164
8 Independence Day 2 135 106 224 248 0 715
4 Easter 1 97 122 172 229 0 621
6 Memorial Day 0 116 150 161 217 0 644
1 Martin Luther King's Day 1 124 90 118 186 155 674
7 Juneteenth National Independence Day 0 89 111 185 185 0 570
5 Easter Monday 0 97 122 164 180 0 563
2 Washington's Birthday 0 64 85 132 179 165 625
9 Labor Day 0 145 144 207 172 0 668
13 Black Friday 2 77 99 157 170 0 505
11 Veterans' Day 1 112 113 158 164 0 548
10 Columbus Day 0 127 114 170 160 0 571
3 Saint Patrick's Day 2 91 116 144 150 178 681
12 Thanksgiving Day 1 75 93 149 133 0 451
14 Christmas Day 3 68 131 174 119 0 495

We display the number of incidents occurring on each holiday for each year as a percentage:

In [ ]:
holidays_df_percents = holidays_df.copy()

holidays_df_percents['n_incidents_2013'] = holidays_df_percents['n_incidents_2013'] / holidays_df_percents[
    holidays_df_percents['holiday']=='Total incidents during the year']['n_incidents_2013'].values[0] * 100
holidays_df_percents['n_incidents_2014'] = holidays_df_percents['n_incidents_2014'] / holidays_df_percents[
    holidays_df_percents['holiday']=='Total incidents during the year']['n_incidents_2014'].values[0] * 100
holidays_df_percents['n_incidents_2015'] = holidays_df_percents['n_incidents_2015'] / holidays_df_percents[
    holidays_df_percents['holiday']=='Total incidents during the year']['n_incidents_2015'].values[0] * 100
holidays_df_percents['n_incidents_2016'] = holidays_df_percents['n_incidents_2016'] / holidays_df_percents[
    holidays_df_percents['holiday']=='Total incidents during the year']['n_incidents_2016'].values[0] * 100
holidays_df_percents['n_incidents_2017'] = holidays_df_percents['n_incidents_2017'] / holidays_df_percents[
    holidays_df_percents['holiday']=='Total incidents during the year']['n_incidents_2017'].values[0] * 100
holidays_df_percents['n_incidents_2018'] = holidays_df_percents['n_incidents_2018'] / holidays_df_percents[
    holidays_df_percents['holiday']=='Total incidents during the year']['n_incidents_2018'].values[0] * 100
holidays_df_percents.drop(holidays_df.index[0], inplace=True)

holidays_df_percents.sort_values(by=['n_incidents_2017'], ascending=False, inplace=True)
holidays_df_percents
Out[ ]:
holiday n_incidents_2013 n_incidents_2014 n_incidents_2015 n_incidents_2016 n_incidents_2017 n_incidents_2018 n_incidents_total
0 New Year's Day 1.185771 0.454884 0.414668 0.376337 0.557103 1.753496 1164
8 Independence Day 0.790514 0.359119 0.237594 0.381445 0.403981 0.000000 715
4 Easter 0.395257 0.258034 0.273457 0.292896 0.373031 0.000000 621
6 Memorial Day 0.000000 0.308576 0.336217 0.274164 0.353484 0.000000 644
1 Martin Luther King's Day 0.395257 0.329857 0.201730 0.200940 0.302986 1.123107 674
7 Juneteenth National Independence Day 0.000000 0.236753 0.248801 0.315033 0.301357 0.000000 570
5 Easter Monday 0.000000 0.258034 0.273457 0.279273 0.293212 0.000000 563
2 Washington's Birthday 0.000000 0.170249 0.190523 0.224780 0.291583 1.195566 625
9 Labor Day 0.000000 0.385720 0.322769 0.352496 0.280180 0.000000 668
13 Black Friday 0.790514 0.204831 0.221903 0.267352 0.276923 0.000000 505
11 Veterans' Day 0.395257 0.297936 0.253284 0.269055 0.267149 0.000000 548
10 Columbus Day 0.000000 0.337838 0.255525 0.289490 0.260633 0.000000 571
3 Saint Patrick's Day 0.790514 0.242073 0.260008 0.245215 0.244343 1.289762 681
12 Thanksgiving Day 0.395257 0.199511 0.208455 0.253729 0.216651 0.000000 451
14 Christmas Day 1.185771 0.180890 0.293630 0.296301 0.193846 0.000000 495

We visualize the data using a bar plot:

In [ ]:
fig = px.bar(
    holidays_df.drop(holidays_df.index[0]),
    x='holiday',
    y=['n_incidents_2013', 'n_incidents_2014', 'n_incidents_2015', 'n_incidents_2016', 'n_incidents_2017', 'n_incidents_2018'],
    title='Number of incidents per holiday',
    labels={'holiday': 'Holiday', 'value': 'Number of incidents', 'variable': 'Year'},
    barmode='group',
)
fig.show()
pyo.plot(fig, filename='../html/incidents_per_holiday.html', auto_open=False);

We notice that the distribution of the number of incidents during each holiday remains consistent over the years. This consistency aligns with our expectations, given the similarity in the distribution of incidents across days throughout the years.

New Year's Eve, followed by the Independence day are the holiday with the highest number of incidents, while Christmas and Thanksgiving are the holidays with the lowest number of incidents.

Geospatial features: exploration and preparation¶

We check if the values of the attribute state are admissible comparing them with an official list of states:

In [ ]:
usa_states_df = pd.read_csv(
    'https://www2.census.gov/geo/docs/reference/state.txt',
    sep='|',
    dtype={'STATE': str, 'STATE_NAME': str}
)
usa_name_alphcode = usa_states_df.set_index('STATE_NAME').to_dict()['STUSAB']
states = incidents_df['state'].unique()
not_existing_states = False
missing_states = False

for state in states:
    if state not in usa_name_alphcode:
        not_existing_states = True
        print(f"State {state} does not exist")

for state in usa_name_alphcode:
    if state not in states:
        missing_states = True
        print(f"State {state} is missing")

if not_existing_states == False:
    print("All the values of the attribute 'states' are actually USA states (there are no misspelling or other errors).")
if missing_states == False:
    print("There is at least one incident for each USA state.")
State American Samoa is missing
State Guam is missing
State Northern Mariana Islands is missing
State Puerto Rico is missing
State U.S. Minor Outlying Islands is missing
State U.S. Virgin Islands is missing
All the values of the attribute 'states' are actually USA states (there are no misspelling or other errors).

We now check if, given a certain value for the attributes latitude and longitude, the attribute city_or_county has always the same value:

In [ ]:
incidents_df.groupby(['latitude', 'longitude'])['city_or_county'].unique()[lambda x: x.str.len() > 1].to_frame()
Out[ ]:
city_or_county
latitude longitude
19.7009 -155.0880 [Hilo, Hawaii (county)]
24.5611 -81.7732 [Key West, Key West (Stock Island)]
24.7186 -81.0668 [Marathon (Grassy Key), Marathon]
24.7250 -81.0470 [Marathon (Duck Key), Marathon]
25.1326 -80.4124 [Key Largo , Key Largo]
... ... ...
59.8040 -151.1450 [Homer, Homer (Fritz Creek)]
60.5844 -151.2350 [Nikiski, Kenai]
60.7854 -161.8010 [Bethel (Newtok), Newtok, Bethel (Napaskiak), Bethel]
61.5803 -149.6110 [Wasilla, Big Lake]
63.3547 -143.3610 [Tok, Tanacross, Tok (Mentasta Lake)]

1518 rows × 1 columns

That is not true and is due to the fact that sometimes the attribute city_or_county takes on the value of the city, other times the value of the county (as in the first row displayed above). Furthermore, we notice that even when the attribute refers to the same county it could be written in different ways (e.g. "Bethel (Newtok)", "Bethel (Napaskiak)", "Bethel").

We now check if a similar problem occurs for the attribute address:

In [ ]:
incidents_df.groupby(['latitude', 'longitude'])['address'].unique()[lambda x: x.str.len() > 1].to_frame()
Out[ ]:
address
latitude longitude
19.6950 -155.066 [2100 Kanoelehua Ave, 2100 Kanoelehua Avenue]
19.7052 -155.062 [2450 Kekuanaoa Street, 2450 Kekuanaoa St.]
19.7297 -155.090 [33rd Avenue, Kamehameha Highway, Kilauea Avenue extension]
21.2766 -157.687 [8102 Kalanianaole Highway, 8102 Kalanianaole Hwy]
21.3061 -157.861 [nan, 1 Keahole Place]
... ... ...
64.8580 -147.701 [423 Merhar Avenue, 425 Merhar Ave]
64.8599 -147.741 [1215 Bunnell St, 1215 Bunnell Street]
64.8621 -147.751 [College Road, 2600 College Rd]
64.8636 -147.759 [1800 College Road, 1800 College Rd]
64.8893 -147.759 [Farmer's Loop, Farmers Loop Road, Farmers Loop]

12315 rows × 1 columns

Still this attribute may be written in different ways (e.g. "Avenue" may also be written as "Ave", or "Highway" as "Hwy"). There could also be some errors (e.g. the same point corresponds to the address "33rd Avenue", "Kamehameha Highway" and "Kilauea Avenue extension").

We plot on a map the location of the incidents:

In [ ]:
fig = px.scatter_mapbox(
    lat=incidents_df['latitude'],
    lon=incidents_df['longitude'],
    zoom=0, 
    height=500,
    width=800
)
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

There are some points in China that are clearly wrong. We display the rows of the dataset that correspond to one of these points:

In [ ]:
incidents_df[(incidents_df['latitude'] == 37.6499) & (incidents_df['longitude'] == 97.4331)]
Out[ ]:
date state city_or_county address latitude longitude congressional_district state_house_district state_senate_district participant_age1 participant_age_group1 participant_gender1 min_age_participants avg_age_participants max_age_participants n_participants_child n_participants_teen n_participants_adult n_males n_females n_killed n_injured n_arrested n_unharmed n_participants notes incident_characteristics1 incident_characteristics2 year date_original month month_name day day_of_week day_of_week_name
30351 2018-03-29 Kansas Wichita NaN 37.6499 97.4331 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 0 NaN NaN 0.0 NaN Non-Shooting Incident TSA Action 2018 2018-03-29 3 March 29 3 Thursday
102739 2018-03-25 Kansas Wichita NaN 37.6499 97.4331 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 0 NaN NaN 0.0 NaN Non-Shooting Incident TSA Action 2018 2018-03-25 3 March 25 6 Sunday

That point has probably the correct values for the attributes state and city_or_county.

To fix these inconsistencies we used the library GeoPy). This library allows to retrieve the address (state, county, suburb, city, town, village, location name, and other features) corresponding to a given latitude and longitude. We queried the library using all the latitudes and longitudes of the points in the dataset and we saved the results in the CSV file we now load:

In [ ]:
geopy_path = os.path.join(DATA_FOLDER_PATH, 'external_data/geopy.csv')
geopy_df = pd.read_csv(geopy_path, index_col=['index'], low_memory=False, dtype={})
geopy_df.sample(2, random_state=1)
Out[ ]:
importance_geopy addresstype_geopy address_geopy coord_presence city_geopy county_geopy state_geopy suburb_geopy village_geopy town_geopy
index
8141 0.000010 building 2706, South Lawndale Avenue, Little Village, South Lawndale, Chicago, West Chicago Township, Cook County, Illinois, 60623, United States True Chicago Cook County Illinois NaN NaN NaN
64095 0.166904 historic King David Kalakaua Building, 335, Merchant Street, Downtown, Hawaii Capital Historic District, Honolulu, Honolulu County, Hawaii, 96813, United States True Honolulu Honolulu County Hawaii Hawaii Capital Historic District NaN NaN

The rows in this dataframe correspond to the rows in the original dataset. Its column coord_presence is false if the corresponding row in the original dataset did not have latitude and longitude values.

Among all the attributes returned by GeoPy, we selected and used the following:

  • importance: Numerical value $\in [0,1]$, indicating the importance of the location (compared to other locations)
  • addresstype: Address type (e.g., "house", "street", "postcode")
  • state: State of the location
  • county: County of the location
  • suburb: Suburb of the location
  • city: City of the location
  • town: Town of the location
  • village: Village of the location
  • display_name: User-friendly representation of the location, often formatted as a complete address
In [ ]:
print(f"Number of rows in which surburb is null: {geopy_df.loc[geopy_df['suburb_geopy'].isna()].shape[0]}\n")
print('Coordinate presence:')
display(geopy_df['coord_presence'].value_counts())
print('Importance presence:')
display(geopy_df['importance_geopy'].notna().value_counts())
print(f"Number of rows in which city is null and town is not null: {geopy_df[(geopy_df['city_geopy'].isnull()) & (geopy_df['town_geopy'].notnull())].shape[0]}\n")
print("Values of addresstype:")
print(geopy_df['addresstype_geopy'].unique())
print(f"\nNumber of rows in which addresstype is null: {geopy_df[geopy_df['addresstype_geopy'].isnull()].shape[0]}")
Number of rows in which surburb is null: 213191

Coordinate presence:
True     231754
False      7923
Name: coord_presence, dtype: int64
Importance presence:
True     231754
False      7923
Name: importance_geopy, dtype: int64
Number of rows in which city is null and town is not null: 37534

Values of addresstype:
['road' 'place' 'building' 'amenity' nan 'aeroway' 'tourism' 'leisure'
 'county' 'shop' 'historic' 'office' 'man_made' 'craft' 'highway' 'city'
 'landuse' 'emergency' 'hamlet' 'railway' 'suburb' 'club' 'municipality'
 'junction' 'village' 'military' 'healthcare' 'locality' 'town'
 'neighbourhood' 'state' 'natural' 'city_district']

Number of rows in which addresstype is null: 7923

We also downloaded from Wikipedia) the list of the counties (or their equivalent) in each state.

This data was used in case incidents data did not match GeoPy data and when latitude and longitude where not available to check whether the county actually belonged to the state.

In [ ]:
counties_path = os.path.join(DATA_FOLDER_PATH, 'external_data/counties.csv')
counties_df = pd.read_csv(counties_path)
counties_df.head()
Out[ ]:
County or equivalent State or equivalent Population (2020 census) Area (mi^2) Core-based statistical area
0 Autauga Alabama 58,805 594.44 Montgomery, AL Metropolitan Statistical Area
1 Baldwin Alabama 231,767 1589.78 Daphne-Fairhope-Foley, AL Metropolitan Statistical Area
2 Barbour Alabama 25,223 884.88 Eufaula, AL-GA Micropolitan Statistical Area
3 Bibb Alabama 22,293 622.58 Birmingham-Hoover, AL Metropolitan Statistical Area
4 Blount Alabama 59,134 644.78 Birmingham-Hoover, AL Metropolitan Statistical Area

We now check and correct the consistency of the geographic data:

In [ ]:
from TASK_1.data_preparation_utils import check_geographical_data_consistency

if LOAD_DATA_FROM_CHECKPOINT:
    incidents_df = load_checkpoint('checkpoint_1', date_cols=['date', 'date_original'])
else:
    geo_df = incidents_df[['state', 'city_or_county', 'address', 'latitude', 'longitude']]
    geo_df = pd.concat([geo_df, geopy_df.loc[incidents_df.index]], axis=1)
    geo_df = geo_df.apply(lambda row: check_geographical_data_consistency(row, additional_data=counties_df), axis=1)
    incidents_df[geo_df.columns] = geo_df[geo_df.columns]
    save_checkpoint(incidents_df, 'checkpoint_1')

The function called above performs the following operations:

  • converts to lowercase the attributes state, county, and city
  • if city_or_county contains values for both city and county, splits them into two different fields
  • removes from city_or_county the words 'city of' and 'county'
  • removes from city_or_county punctuation and numerical values
  • removes frequent words from address and display_name (e.g., "Street," "Avenue," "Boulevard")

When latitude and longitude are available and therefore Geopy provided information for the corresponding location:

  • checks for equality between state and state_geopy
  • checks for equality between county and county_geopy or between county and suburb_geopy
  • checks for equality between city and city_geopy, or between city and town_geopy, or between city and village_geopy

If these comparisons fail, it checks for potential typos in the string. This is done using the Damerau-Levenshtein distance (defined below), with a threshold to decide the maximum distance for two strings to be considered equal. The thresholds were set after several preliminary tests. We decided to use different thresholds for state and city or county.

The Damerau-Levenshtein distance between two strings $s$ and $t$ is define as\ $D(i, j) = \min \begin{cases} D(i-1, j) + 1 \\ D(i, j-1) + 1 \\ D(i-1, j-1) + \delta \\ D(i-2, j-2) + \delta & \text{if } s[i] = t[j] \text{ and } s[i-1] = t[j-1] \end{cases}$

where:

  • $D(i, j)$ is the Damerau-Levenshtein distance between the first $i$ letters of a string $s$ and the first $j$ letters of a string $t$
  • $\delta$ is 0 if the current letters $s[i]$ and $t[j]$ are equal, otherwise, it is 1
  • $D(i-2, j-2) + \delta$ represents transposition (swapping two adjacent letters) if the current letters $s[i]$ and $t[j]$ are equal, and the preceding letters $s[i-1]$ and $t[j-1]$ are also equal

If the comparison still fails, it compares the address field from our dataset with GeoPy's display_name (using again the Damerau-Levenshtein distance).

In case of a match with GeoPy data, we set the values for the fields state, county, city, latitude, longitude, importance, address_type to the corresponding fileds provided by GeoPy. Otherwise we check for matches with the Wikipedia data (using again the Damerau-Levenshtein distance).

In [ ]:
tot_row = incidents_df.index.size
print('Number of rows with all null values: ', incidents_df.isnull().all(axis=1).sum(), ' / ', incidents_df.isnull().all(axis=1).sum()*100/tot_row, '%')
print('Number of rows with null value for state: ', incidents_df['state'].isnull().sum(), ' / ', incidents_df['state'].isnull().sum()*100/tot_row, '%')
print('Number of rows with null value for county: ', incidents_df['county'].isnull().sum(), ' / ', incidents_df['county'].isnull().sum()*100/tot_row, '%')
print('Number of rows with null value for city: ', incidents_df['city'].isnull().sum(), ' / ', incidents_df['city'].isnull().sum()*100/tot_row, '%')
print('Number of rows with null value for latitude: ', incidents_df['latitude'].isnull().sum(), ' / ', incidents_df['latitude'].isnull().sum()*100/tot_row, '%')
print('Number of rows with null value for longitude: ', incidents_df['longitude'].isnull().sum(), ' / ', incidents_df['longitude'].isnull().sum()*100/tot_row, '%')
Number of rows with all null values:  0  /  0.0 %
Number of rows with null value for state:  0  /  0.0 %
Number of rows with null value for county:  20914  /  8.736700072269729 %
Number of rows with null value for city:  35794  /  14.952732255275064 %
Number of rows with null value for latitude:  12796  /  5.345453482105931 %
Number of rows with null value for longitude:  12796  /  5.345453482105931 %
In [ ]:
sns.heatmap(incidents_df.isnull(), cbar=False, xticklabels=True);

Now all the entries in the dataset have at least the state value not null and consistent. Only 12,796 data points, which account for 5.34% of the dataset, were found to have inconsistent latitude and longitude values.

In [ ]:
incidents_df.groupby(['state_consistency','county_consistency','address_consistency']).size().to_frame().rename(columns={0:'count'}).sort_index(ascending=False)
Out[ ]:
count
state_consistency county_consistency address_consistency
1 1 1 181794
0 13201
-1 4956
-1 1 26621
0 2912
-1 1887
-1 1 1 13
0 4
-1 2
-1 1 53
0 11
-1 7927
In [ ]:
stats = {}
stats_columns = ['#null_val', '#not_null', '#value_count']
for col in ['state', 'county', 'city', 'latitude', 'longitude', 'state_consistency',
       'county_consistency', 'address_consistency', 'location_importance', 'address_type']:
    stats[col] = []
    stats[col].append(incidents_df[col].isna().sum())
    stats[col].append(len(incidents_df[col]) - incidents_df[col].isna().sum())
    stats[col].append(len(incidents_df[col].value_counts()))
    
clean_geo_stat_stats = pd.DataFrame(stats, index=stats_columns).transpose()
clean_geo_stat_stats
Out[ ]:
#null_val #not_null #value_count
state 0 239381 51
county 20914 218467 1955
city 35794 203587 8302
latitude 12796 226585 99826
longitude 12796 226585 110433
state_consistency 0 239381 2
county_consistency 0 239381 2
address_consistency 0 239381 3
location_importance 12796 226585 709
address_type 12796 226585 32
In [ ]:
incidents_df[['latitude', 'county', 'city']].isna().groupby(['latitude', 'county', 'city']).size().to_frame().rename(columns={0:'count'})
Out[ ]:
count
latitude county city
False False False 193710
True 22995
True False 9877
True 3
True False True 1762
True True 11034
In [ ]:
incidents_df[['latitude']].isna().groupby(['latitude']).size().to_frame().rename(columns={0:'count'})
Out[ ]:
count
latitude
False 226585
True 12796
In [ ]:
incidents_df_not_null = incidents_df[incidents_df['latitude'].notna()]
print('Number of entries with not null values for latitude and longitude: ', len(incidents_df_not_null))
plot_scattermap_plotly(incidents_df_not_null, 'state', zoom=2, title='Incidents distribution by state')
Number of entries with not null values for latitude and longitude:  226585
In [ ]:
incidents_df_nan_county = incidents_df.loc[(incidents_df['latitude'].notna()) & (incidents_df['county'].isna()) & 
    (incidents_df['city'].notna())]
print('Number of entries with not null values for county but not for lat/lon and city: ', len(incidents_df_nan_county))
plot_scattermap_plotly(incidents_df_nan_county, 'state', zoom=2, title='Missing county')
Number of entries with not null values for county but not for lat/lon and city:  9877
In [ ]:
incidents_df.loc[(incidents_df['latitude'].notna()) & (incidents_df['county'].isna()) & (incidents_df['city'].notna())].groupby('city').count()
Out[ ]:
date state city_or_county address latitude longitude congressional_district state_house_district state_senate_district participant_age1 participant_age_group1 participant_gender1 min_age_participants avg_age_participants max_age_participants n_participants_child n_participants_teen n_participants_adult n_males n_females n_killed n_injured n_arrested n_unharmed n_participants notes incident_characteristics1 incident_characteristics2 year date_original month month_name day day_of_week day_of_week_name county state_consistency county_consistency address_consistency location_importance address_type
city
Alexandria 39 41 41 41 41 41 41 41 41 27 34 35 29 30 28 34 34 34 35 35 41 41 36 36 41 32 41 20 39 41 41 41 41 41 41 0 41 41 41 41 41
Aurora 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1
Baltimore 1567 1856 1856 1856 1856 1856 1797 1631 1797 1371 1792 1795 1437 1418 1427 1792 1792 1792 1795 1795 1856 1856 1821 1821 1856 591 1855 612 1567 1856 1856 1856 1856 1856 1856 0 1856 1856 1856 1856 1856
Brooklyn Park 35 36 36 36 36 36 36 36 36 31 35 34 31 32 31 35 35 35 34 34 36 36 35 35 36 8 36 11 35 36 36 36 36 36 36 0 36 36 36 36 36
Broomfield 6 7 7 7 7 7 7 7 7 3 6 6 4 4 3 6 6 6 6 6 7 7 7 7 7 2 7 5 6 7 7 7 7 7 7 0 7 7 7 7 7
Carson City 25 27 27 27 27 27 26 26 26 21 24 24 22 22 22 24 24 24 24 24 27 27 23 23 27 13 27 23 25 27 27 27 27 27 27 0 27 27 27 27 27
Charlottesville 57 58 58 58 58 58 58 58 58 38 44 45 43 40 41 44 44 44 45 45 58 58 46 46 58 56 58 28 57 58 58 58 58 58 58 0 58 58 58 58 58
Chesapeake 87 112 112 109 112 112 112 112 112 52 97 99 63 60 63 97 97 97 99 99 112 112 106 106 112 95 111 41 87 112 112 112 112 112 112 0 112 112 112 112 112
Colonial Heights 10 13 13 12 13 13 13 13 13 7 11 11 8 7 8 11 11 11 11 11 13 13 11 11 13 11 13 9 10 13 13 13 13 13 13 0 13 13 13 13 13
Covington 5 6 6 6 6 6 6 6 6 2 3 3 2 2 2 3 3 3 3 3 6 6 3 3 6 3 6 4 5 6 6 6 6 6 6 0 6 6 6 6 6
Danville 116 131 131 131 131 131 131 131 130 89 111 110 92 92 89 111 111 111 110 110 131 131 115 115 131 120 131 59 116 131 131 131 131 131 131 0 131 131 131 131 131
Denver 553 578 578 574 578 578 572 572 572 166 289 339 193 210 200 289 289 289 339 339 578 578 412 412 578 255 577 279 553 578 578 578 578 578 578 0 578 578 578 578 578
Falls Church 3 3 3 3 3 3 3 3 3 0 2 2 2 0 0 2 2 2 2 2 3 3 3 3 3 1 3 1 3 3 3 3 3 3 3 0 3 3 3 3 3
Franklin 13 15 15 15 15 15 15 15 15 9 12 13 9 10 10 12 12 12 13 13 15 15 14 14 15 12 15 6 13 15 15 15 15 15 15 0 15 15 15 15 15
Fredericksburg 29 30 30 30 30 30 30 30 30 18 19 19 18 18 18 19 19 19 19 19 30 30 19 19 30 28 30 12 29 30 30 30 30 30 30 0 30 30 30 30 30
Galax 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1
Hampton 194 208 208 208 208 208 205 205 205 144 183 184 154 154 153 183 183 183 184 184 208 208 194 194 208 182 207 89 194 208 208 208 208 208 208 0 208 208 208 208 208
Harrisonburg 12 12 12 11 12 12 12 12 12 7 9 8 8 8 8 9 9 9 8 8 12 12 10 10 12 9 12 9 12 12 12 12 12 12 12 0 12 12 12 12 12
Hopewell 42 46 46 46 46 46 45 45 45 34 43 44 37 37 38 43 43 43 44 44 46 46 46 46 46 39 46 18 42 46 46 46 46 46 46 0 46 46 46 46 46
Juneau 90 94 94 83 94 94 91 91 0 33 41 39 33 34 35 41 41 41 39 39 94 94 41 41 94 86 94 76 90 94 94 94 94 94 94 0 94 94 94 94 94
Lexington 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1
Lynchburg 84 89 89 85 89 89 88 88 88 65 79 79 68 69 68 79 79 79 79 79 89 89 82 82 89 78 86 45 84 89 89 89 89 89 89 0 89 89 89 89 89
Manassas 16 17 17 17 17 17 17 17 17 9 11 10 10 9 9 11 11 11 10 10 17 17 11 11 17 12 17 7 16 17 17 17 17 17 17 0 17 17 17 17 17
Manassas Park 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1
Martinsville 13 13 13 12 13 13 13 13 13 9 11 11 10 10 10 11 11 11 11 11 13 13 11 11 13 11 13 4 13 13 13 13 13 13 13 0 13 13 13 13 13
Norfolk 459 546 546 520 546 546 544 544 544 282 437 444 302 314 316 437 437 437 444 444 546 546 497 497 546 485 544 175 459 546 546 546 546 546 546 0 546 546 546 546 546
Norton 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 0 3 3 3 3 3
Philadelphia 110 119 119 74 119 119 118 3 3 74 90 90 75 76 76 90 90 90 90 90 119 119 94 94 119 73 119 53 110 119 119 119 119 119 119 0 119 119 119 119 119
Portsmouth 143 182 182 163 182 182 182 182 182 59 152 158 93 86 85 152 152 152 158 158 182 182 173 173 182 157 182 53 143 182 182 182 182 182 182 0 182 182 182 182 182
Richmond 236 303 303 278 303 303 301 301 301 118 269 273 154 159 161 269 269 269 273 273 303 303 286 286 303 272 300 99 236 303 303 303 303 303 303 0 303 303 303 303 303
Roanoke 67 73 73 64 73 73 73 73 73 46 57 59 48 47 47 57 57 57 59 59 73 73 62 62 73 69 72 40 67 73 73 73 73 73 73 0 73 73 73 73 73
Saint Louis 1994 2074 2074 2004 2074 2074 2056 2056 2056 1441 1767 1929 1536 1558 1544 1767 1767 1767 1929 1929 2074 2074 2013 2013 2074 1347 2072 961 1994 2074 2074 2074 2074 2074 2074 0 2074 2074 2074 2074 2074
Salem 5 5 5 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 3 5 5 5 5 5 5 5 5 5 0 5 5 5 5 5
San Francisco 318 378 378 364 378 378 377 377 377 220 331 326 240 245 239 331 331 331 326 326 378 378 353 353 378 199 378 183 318 378 378 378 378 378 378 0 378 378 378 378 378
Staunton 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 8 10 10 10 10 10 10 10 0 10 10 10 10 10
Suffolk 14 16 16 12 16 16 16 16 16 7 13 13 8 9 10 13 13 13 13 13 16 16 15 15 16 15 16 8 14 16 16 16 16 16 16 0 16 16 16 16 16
Virginia Beach 192 246 246 244 246 246 246 246 246 89 194 196 105 114 107 194 194 194 196 196 246 246 222 222 246 227 246 115 192 246 246 246 246 246 246 0 246 246 246 246 246
Washington 2002 2493 2493 2357 2493 2493 2129 0 0 462 1464 1315 580 589 569 1464 1464 1464 1315 1315 2493 2493 1706 1706 2493 573 2490 988 2002 2493 2493 2493 2493 2493 2493 0 2493 2493 2493 2493 2493
Waynesboro 24 25 25 24 25 25 25 25 25 21 23 23 21 22 21 23 23 23 23 23 25 25 23 23 25 25 25 19 24 25 25 25 25 25 25 0 25 25 25 25 25
Winchester 8 8 8 6 8 8 8 8 8 6 8 8 8 8 6 8 8 8 8 8 8 8 8 8 8 7 8 6 8 8 8 8 8 8 8 0 8 8 8 8 8
In [ ]:
incidents_df[(incidents_df['latitude'].notna()) & (incidents_df['city'].isna()) & (incidents_df['county'].isna())]
Out[ ]:
date state city_or_county address latitude longitude congressional_district state_house_district state_senate_district participant_age1 participant_age_group1 participant_gender1 min_age_participants avg_age_participants max_age_participants n_participants_child n_participants_teen n_participants_adult n_males n_females n_killed n_injured n_arrested n_unharmed n_participants notes incident_characteristics1 incident_characteristics2 year date_original month month_name day day_of_week day_of_week_name county city state_consistency county_consistency address_consistency location_importance address_type
39689 2014-06-23 MICHIGAN Traverse City 727 Fly Don't Dr 44.7881 -85.5539 1.0 104.0 37.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 0 NaN NaN 0.0 Cherry Capital Airport TSA Action NaN 2014.0 2014-06-23 6 June 23 0 Monday NaN NaN 1 -1 1 0.737851 state
68656 2014-01-22 MICHIGAN Traverse City 727 Fly Don't Dr 44.7881 -85.5539 1.0 104.0 37.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 0 NaN NaN 0.0 Cherry Capital Airport TSA Action NaN 2014.0 2014-01-22 1 January 22 2 Wednesday NaN NaN 1 -1 1 0.737851 state
204688 2015-07-08 MICHIGAN Traverse City 727 Fly Don't Dr 44.7881 -85.5539 1.0 104.0 37.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0 0 NaN NaN 0.0 NaN Non-Shooting Incident TSA Action 2015.0 2015-07-08 7 July 8 2 Wednesday NaN NaN 1 -1 1 0.737851 state
In [ ]:
incidents_df_nan_city = incidents_df.loc[(incidents_df['latitude'].notna()) & (incidents_df['county'].notna()) & (incidents_df['city'].isna())]
print('Number of rows with null values for city, but not for lat/lon and county: ', len(incidents_df_nan_city))
plot_scattermap_plotly(incidents_df_nan_city, 'state', zoom=2, title='Missing city')
Number of rows with null values for city, but not for lat/lon and county:  22995

Final considerations

From this analysis we found that:

  • 174,796 entries are fully consistent
  • in 26,635 entries only the city is missing, but it can be inferred from latitude and longitude
  • in 15,000 entries only the county is missing, but it can be inferred from latitude and longitude
  • in 33 entries both the city and county are missing. Even in this group, the missing information can be inferred from latitude and longitude, as they are all closely clustered around Baltimore
  • in 3,116 entries latitude, longitude, and city are missing. They can be inferred (though not perfectly) from the county-state pair
  • in 19,844 entries only the state field is present

The dataset does not contain any other combinations beyond the ones just mentioned.

In the following we will recover the missing information using the latitude and longitude values.

City attribute¶

To recover missing values for the attribute city when latitude and longitude are available we will compute the closest city centroid and assign the corresponding city if the distance is below a certain threshold.

First we Compute the centroid for each city and visualize the first 10 centroids in lexicographic order:

In [ ]:
centroids = incidents_df.loc[incidents_df['latitude'].notna() & incidents_df['city'].notna()][[
    'latitude', 'longitude', 'city', 'state', 'county']].groupby(['state', 'county', 'city']).mean()
centroids.head(10)
Out[ ]:
latitude longitude
state county city
ALABAMA Autauga County Autaugaville 32.432700 -86.656433
Prattville 32.465774 -86.456941
Baldwin County Bay Minette 30.870900 -87.787700
Daphne 30.624720 -87.884390
Elberta 30.414333 -87.588567
Fairhope 30.520640 -87.880040
Foley 30.389367 -87.690533
Gulf Shores 30.279338 -87.696012
Loxley 30.670000 -87.758700
Orange Beach 30.293400 -87.589800
In [ ]:
print('Number of distinct cities:', len(centroids.index.to_list()))
Number of distinct cities: 11271

For each tuple <state, county, city> in 'centroids', we extract the corresponding latitude and longitude coordinates from the 'clean_geo_data' DataFrame. We then compute the distance between these coordinates and the centroids using the geodesic distance (in kilometers). We also compute percentiles (at 0.05 intervals), maximum, minimum, and average distances of the points in each city.

In [ ]:
info_city = pd.DataFrame(columns=['5', '15', '25', '35', '45', '55', '65', '75', '85', '95',
    'tot_points', 'min', 'max', 'avg', 'centroid_lat', 'centroid_lon'], index=centroids.index)
info_city.head(2)

if LOAD_DATA_FROM_CHECKPOINT: # load data
    info_city = load_checkpoint('checkpoint_cities')
else: # compute data
    for state, county, city in centroids.index:
        dummy = []
        for lat, long in zip(incidents_df.loc[(incidents_df['city'] == city) & 
            (incidents_df['state'] == state) & (incidents_df['county'] == county) & 
            incidents_df['latitude'].notna()]['latitude'], 
            incidents_df.loc[(incidents_df['city'] == city) & 
            (incidents_df['state'] == state) & (incidents_df['county'] == county) & 
            incidents_df['longitude'].notna()]['longitude']):
            dummy.append(geopy_distance.geodesic([lat, long], centroids.loc[state, county, city]).km)
            
        dummy = sorted(dummy)
        pc = np.quantile(dummy, np.arange(0, 1, 0.05))
        for i in range(len(info_city.columns) - 6):
            info_city.loc[state, county, city][i] = pc[i*2 + 1]
        info_city.loc[state, county, city][len(info_city.columns) - 6] = len(dummy)
        info_city.loc[state, county, city][len(info_city.columns) - 5] = min(dummy)
        info_city.loc[state, county, city][len(info_city.columns) - 4] = max(dummy)
        info_city.loc[state, county, city][len(info_city.columns) - 3] = sum(dummy)/len(dummy)
        info_city.loc[state, county, city][len(info_city.columns) - 2] = centroids.loc[state, county, city]['latitude']
        info_city.loc[state, county, city][len(info_city.columns) - 1] = centroids.loc[state, county, city]['longitude']
    save_checkpoint(info_city, 'checkpoint_cities') # save data 

We display the resulting dataframe and plot it on a map:

In [ ]:
info_city.head()
Out[ ]:
county city 5 15 25 35 45 55 65 75 85 95 tot_points min max avg centroid_lat centroid_lon
state
ALABAMA Autauga County Autaugaville 0.274213 0.289707 0.305201 0.320695 0.336188 0.370072 0.422346 0.474619 0.526893 0.579166 3 0.266466 0.605303 0.405235 32.432700 -86.656433
ALABAMA Autauga County Prattville 0.352777 1.233816 1.487384 2.595130 2.728048 3.498539 3.823870 3.912885 4.881287 6.568004 27 0.325572 8.573536 3.135990 32.465774 -86.456941
ALABAMA Baldwin County Bay Minette 0.649590 0.851911 0.851911 0.858401 0.998652 1.141183 1.274665 1.507917 2.596292 5.745650 9 0.514709 7.663826 1.936234 30.870900 -87.787700
ALABAMA Baldwin County Daphne 0.881853 1.510370 2.681879 2.700294 2.746789 3.010000 3.450319 4.034984 4.257822 5.261581 10 0.881853 6.059492 3.096890 30.624720 -87.884390
ALABAMA Baldwin County Elberta 0.320503 0.641146 0.961789 1.282432 1.603075 1.777587 1.805968 1.834349 1.862730 1.891110 3 0.160181 1.905301 1.276293 30.414333 -87.588567
In [ ]:
info_city.loc[info_city['tot_points'] > 1].info()
<class 'pandas.core.frame.DataFrame'>
Index: 6868 entries, ALABAMA to WYOMING
Data columns (total 18 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   county        6868 non-null   object 
 1   city          6868 non-null   object 
 2   5             6868 non-null   float64
 3   15            6868 non-null   float64
 4   25            6868 non-null   float64
 5   35            6868 non-null   float64
 6   45            6868 non-null   float64
 7   55            6868 non-null   float64
 8   65            6868 non-null   float64
 9   75            6868 non-null   float64
 10  85            6868 non-null   float64
 11  95            6868 non-null   float64
 12  tot_points    6868 non-null   int64  
 13  min           6868 non-null   float64
 14  max           6868 non-null   float64
 15  avg           6868 non-null   float64
 16  centroid_lat  6868 non-null   float64
 17  centroid_lon  6868 non-null   float64
dtypes: float64(15), int64(1), object(2)
memory usage: 1019.5+ KB
In [ ]:
fig = plot_scattermap_plotly(
    info_city,
    size='tot_points',
    x_column='centroid_lat',
    y_column='centroid_lon',
    hover_name=False, zoom=2,
    title='Number of incidents per city'
)
fig.show()
pyo.plot(fig, filename='../html/incidents_per_city.html', auto_open=False);

We utilize the previously calculated data to infer missing values for the city field in entries of the dataset where latitude and longitude are available. The city field is assigned if the distance of the entry from the centroid falls within the third quartile of all points assigned to that centroid.

In [ ]:
def substitute_city(row, info_city):
    if pd.isna(row['city']) and not np.isnan(row['latitude']):
        for state, county, city in info_city.index:
            if row['state'] == state and row['county'] == county:
                if info_city.loc[state, county, city]['tot_points'] > 1:
                    max_radius = info_city.loc[state, county, city]['75'] # terzo quantile
                    centroid_coord = [info_city.loc[state, county, city]['centroid_lat'], 
                        info_city.loc[state, county, city]['centroid_lon']]
                    if (geopy_distance.geodesic([row['latitude'], row['longitude']], centroid_coord).km <= 
                        max_radius):
                        row['city'] = city
                        break
                    
    return row
In [ ]:
if LOAD_DATA_FROM_CHECKPOINT:
    new_incidents_df = load_checkpoint('checkpoint_2', date_cols=['date', 'date_original'])
else:
    new_incidents_df = incidents_df.apply(lambda row: substitute_city(row, info_city), axis=1)
    save_checkpoint(incidents_df, 'checkpoint_2')
In [ ]:
n_nan_cities = incidents_df['city'].isnull().sum()
n_non_nan_cities = new_incidents_df['city'].isnull().sum()
print('Number of rows with null values for city before the inference: ', n_nan_cities)
print('Number of rows with null values for city after the inference: ', n_non_nan_cities)
print(f'Number of city recovered: {n_nan_cities - n_non_nan_cities}')
Number of rows with null values for city before the inference:  35794
Number of rows with null values for city after the inference:  33545
Number of city recovered: 2249
In [ ]:
incidents_df_nan_city = new_incidents_df.loc[(incidents_df['latitude'].notna()) & (incidents_df['county'].notna()) & (incidents_df['city'].isna())]
print('Number of rows with null values for city, but not for lat/lon and county: ', len(incidents_df_nan_city))
plot_scattermap_plotly(incidents_df_nan_city, 'city', zoom=2, title='City inferred')
Number of rows with null values for city, but not for lat/lon and county:  22995
In [ ]:
incidents_df = new_incidents_df
incidents_df[['latitude', 'county', 'city']].isna().groupby(['latitude', 'county', 'city']).size().to_frame().rename(columns={0:'count'})
Out[ ]:
count
latitude county city
False False False 195959
True 20746
True False 9877
True 3
True False True 1762
True True 11034

Districts attributes¶

We check if the attribute congressional_district is numbered consistently (with '0' for states with only one congressional district). To do so we use the dataset containing the data about elections in the period of interest (congressional districts are redrawn when (year%10)==0):

In [ ]:
at_large_states = elections_df[
    (elections_df['year'].between(2013, 2018, inclusive="both")) & 
    (elections_df['congressional_district']==0)
    ]['state'].unique()
at_large_states
Out[ ]:
array(['ALASKA', 'DELAWARE', 'DISTRICT OF COLUMBIA', 'MONTANA',
       'NORTH DAKOTA', 'SOUTH DAKOTA', 'VERMONT', 'WYOMING'], dtype=object)

Now we check if states with a '0' as congressional district are the same states with only one congressional district in the dataset containing the data about elections:

In [ ]:
zero_congress_states_inc = incidents_df[incidents_df['congressional_district']==0]['state'].unique()
set(zero_congress_states_inc).issubset(set(at_large_states))
Out[ ]:
True

We check if states with a single congressional district are always numbered with '0' in the dataset containing the data about elections:

In [ ]:
incidents_df[(incidents_df['state'] == at_large_states.any()) & (incidents_df['congressional_district']!=0)].size==0
Out[ ]:
False

Since they are not, we fix this issue:

In [ ]:
incidents_df.loc[incidents_df['state'].isin(at_large_states), 'congressional_district'] = 0

We check if the range of the attribute congressional_district is consistent with the number of congressional districts in the dataset containing the data about elections:

In [ ]:
incidents_df['state'] = incidents_df['state'].str.upper()
wrong_congr_states = elections_df.groupby('state')['congressional_district'].max()>=incidents_df.groupby('state')['congressional_district'].max()
for state in wrong_congr_states[wrong_congr_states==False].index:
    print(f"State {state} has more districts in the incidents data than in the elections data")
State KENTUCKY has more districts in the incidents data than in the elections data
State OREGON has more districts in the incidents data than in the elections data
State WEST VIRGINIA has more districts in the incidents data than in the elections data

We display the rows with inconsistent congressional district in Kentucky:

In [ ]:
incidents_df[
    (incidents_df['state']=='KENTUCKY') &
    (incidents_df['congressional_district'] > 
        elections_df[(elections_df['state']=='KENTUCKY') & (elections_df['year']>2012)]['congressional_district'].max())
]
Out[ ]:
date state city_or_county address latitude longitude congressional_district state_house_district state_senate_district participant_age1 participant_age_group1 participant_gender1 min_age_participants avg_age_participants max_age_participants n_participants_child n_participants_teen n_participants_adult n_males n_females n_killed n_injured n_arrested n_unharmed n_participants notes incident_characteristics1 incident_characteristics2 year date_original month month_name day day_of_week day_of_week_name county city state_consistency county_consistency address_consistency location_importance address_type
69605 2014-07-30 KENTUCKY Lexington Woodland Avenue near Main Street NaN NaN 8.0 79.0 10.0 23.0 Adult 18+ Male 23.0 23.0 23.0 0.0 0.0 2.0 2.0 0.0 0 0 0.0 2.0 2.0 man robbed at gunpoint Armed robbery with injury/death and/or evidence of DGU found NaN 2014.0 2014-07-30 7 July 30 2 Wednesday NaN NaN -1 -1 1 NaN NaN

Searching online we found that Kentucky, in that period, had 6 congressional districts, so we'll set to nan the congressional district for the row above:

In [ ]:
incidents_df.loc[
    (incidents_df['state']=='KENTUCKY') &
    (incidents_df['congressional_district'] > 
        elections_df[(elections_df['state']=='KENTUCKY') & (elections_df['year']>2012)]['congressional_district'].max()),
    'congressional_district'] = np.nan

We display the rows with inconsistent congressional district in Oregon:

In [ ]:
incidents_df[
    (incidents_df['state']=='OREGON') &
    (incidents_df['congressional_district'] > 
        elections_df[(elections_df['state']=='OREGON') & (elections_df['year']>2012)]['congressional_district'].max())
]
Out[ ]:
date state city_or_county address latitude longitude congressional_district state_house_district state_senate_district participant_age1 participant_age_group1 participant_gender1 min_age_participants avg_age_participants max_age_participants n_participants_child n_participants_teen n_participants_adult n_males n_females n_killed n_injured n_arrested n_unharmed n_participants notes incident_characteristics1 incident_characteristics2 year date_original month month_name day day_of_week day_of_week_name county city state_consistency county_consistency address_consistency location_importance address_type
205517 2015-06-24 OREGON Dayton NaN NaN NaN 10.0 39.0 5.0 43.0 Adult 18+ Male 43.0 43.0 43.0 0.0 0.0 1.0 1.0 0.0 0 0 1.0 0.0 1.0 NaN Non-Shooting Incident Drug involvement 2015.0 2015-06-24 6 June 24 2 Wednesday NaN NaN -1 1 0 NaN NaN
216127 2014-02-06 OREGON Marysville NaN NaN NaN 9.0 66.0 45.0 27.0 Adult 18+ Male 27.0 27.0 27.0 0.0 0.0 1.0 1.0 0.0 0 0 0.0 1.0 1.0 NaN Armed robbery with injury/death and/or evidence of DGU found NaN 2014.0 2014-02-06 2 February 6 3 Thursday NaN NaN -1 1 0 NaN NaN

Searching online we found that Oregon, in that period, had 5 congressional districts, so we'll set to nan the congressional district for the rows above:

In [ ]:
incidents_df.loc[
    (incidents_df['state']=='OREGON') &
    (incidents_df['congressional_district'] > 
        elections_df[(elections_df['state']=='OREGON') & (elections_df['year']>2012)]['congressional_district'].max()),
    'congressional_district'] = np.nan 

We display the rows with inconsistent congressional district in West Virginia:

In [ ]:
incidents_df[
    (incidents_df['state']=='WEST VIRGINIA') &
    (incidents_df['congressional_district'] > 
        elections_df[(elections_df['state']=='WEST VIRGINIA') & (elections_df['year']>2012)]['congressional_district'].max())
]
Out[ ]:
date state city_or_county address latitude longitude congressional_district state_house_district state_senate_district participant_age1 participant_age_group1 participant_gender1 min_age_participants avg_age_participants max_age_participants n_participants_child n_participants_teen n_participants_adult n_males n_females n_killed n_injured n_arrested n_unharmed n_participants notes incident_characteristics1 incident_characteristics2 year date_original month month_name day day_of_week day_of_week_name county city state_consistency county_consistency address_consistency location_importance address_type
221378 2017-05-30 WEST VIRGINIA Hartford NaN NaN NaN 6.0 94.0 30.0 52.0 Adult 18+ Male 32.0 42.0 52.0 0.0 0.0 2.0 2.0 0.0 0 1 0.0 1.0 2.0 Clay Co, father shot son with same name, son threat with shotgun, father shot with 22 cal revolver Shot - Wounded/Injured Defensive Use 2017.0 2017-05-30 5 May 30 1 Tuesday NaN NaN -1 -1 0 NaN NaN

Searching online we found that West Virginia, in that period, had 3 congressional districts, so we'll set to nan the congressional district for the row above:

In [ ]:
incidents_df.loc[
    (incidents_df['state']=='WEST VIRGINIA') &
    (incidents_df['congressional_district'] > 
        elections_df[(elections_df['state']=='WEST VIRGINIA') & (elections_df['year']>2012)]['congressional_district'].max()),
    'congressional_district'] = np.nan

We check whether given a certain value for the attributes latitude and longitude, the attribute congressional_district has always the same value:

In [ ]:
incidents_df[incidents_df['congressional_district'].notnull()].groupby(['latitude', 'longitude'])['congressional_district'].unique()[lambda x: x.str.len() > 1].to_frame().rename(columns={0:'count'}).sample(5, random_state=1)
Out[ ]:
congressional_district
latitude longitude
36.6732 -76.9284 [3.0, 4.0]
37.5478 -77.4219 [4.0, 3.0]
30.4153 -84.2972 [2.0, 5.0]
37.0438 -76.3586 [2.0, 3.0]
36.1162 -79.4353 [4.0, 6.0]

All these points are probably errors, due to the fact that they are near the border between two congressional districts. We correct them setting the most frequent value for the attribute congressional_district (setting that value also for the entries with missing values):

In [ ]:
corrected_congr_districts = incidents_df[
    ~incidents_df['congressional_district'].isna()
    ].groupby(['latitude', 'longitude'])['congressional_district'].agg(lambda x: x.value_counts().index[0])
incidents_df = incidents_df.merge(corrected_congr_districts, on=['latitude', 'longitude'], how='left')
# where latitude and longitude are null, keep the original value
incidents_df['congressional_district_y'].fillna(incidents_df['congressional_district_x'], inplace=True)
incidents_df.rename(columns={'congressional_district_y':'congressional_district'}, inplace=True)
incidents_df.drop(columns=['congressional_district_x'], inplace=True)

In the same city or county there could be different values for the attribute congressional_district (this is not an error, is actually possible according to the USA law):

In [ ]:
incidents_df[incidents_df['congressional_district'].notna()].groupby(['state', 'city_or_county'])['congressional_district'].unique()[lambda x: x.str.len() > 1].to_frame()
Out[ ]:
congressional_district
state city_or_county
ALABAMA Adamsville [6.0, 7.0]
Auburn [3.0, 1.0]
Bessemer [7.0, 6.0]
Bessemer (Hueytown) [7.0, 6.0]
Birmingham [7.0, 6.0, 4.0]
... ... ...
WISCONSIN Nekoosa [7.0, 3.0]
Platteville [2.0, 3.0]
Richfield [7.0, 5.0]
Rock (county) [1.0, 2.0]
Wisconsin Dells [6.0, 3.0, 2.0]

1598 rows × 1 columns

We print the unique values the attribute state_house_district can take on:

In [ ]:
house_districts = incidents_df['state_house_district'].unique()
house_districts.sort()
house_districts
Out[ ]:
array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,
        12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,  20.,  21.,  22.,
        23.,  24.,  25.,  26.,  27.,  28.,  29.,  30.,  31.,  32.,  33.,
        34.,  35.,  36.,  37.,  38.,  39.,  40.,  41.,  42.,  43.,  44.,
        45.,  46.,  47.,  48.,  49.,  50.,  51.,  52.,  53.,  54.,  55.,
        56.,  57.,  58.,  59.,  60.,  61.,  62.,  63.,  64.,  65.,  66.,
        67.,  68.,  69.,  70.,  71.,  72.,  73.,  74.,  75.,  76.,  77.,
        78.,  79.,  80.,  81.,  82.,  83.,  84.,  85.,  86.,  87.,  88.,
        89.,  90.,  91.,  92.,  93.,  94.,  95.,  96.,  97.,  98.,  99.,
       100., 101., 102., 103., 104., 105., 106., 107., 108., 109., 110.,
       111., 112., 113., 114., 115., 116., 117., 118., 119., 120., 121.,
       122., 123., 124., 125., 126., 127., 128., 129., 130., 131., 132.,
       133., 134., 135., 136., 137., 138., 139., 140., 141., 142., 143.,
       144., 145., 146., 147., 148., 149., 150., 151., 152., 153., 154.,
       155., 156., 157., 158., 159., 160., 161., 162., 163., 164., 165.,
       166., 167., 168., 169., 170., 171., 172., 173., 174., 175., 176.,
       177., 178., 179., 180., 181., 182., 183., 184., 185., 186., 187.,
       188., 189., 190., 191., 192., 193., 194., 195., 196., 197., 198.,
       199., 200., 201., 202., 203., 204., 205., 206., 207., 208., 209.,
       210., 211., 212., 214., 215., 216., 217., 218., 219., 303., 411.,
       413., 503., 506., 510., 511., 512., 513., 514., 515., 516., 517.,
       518., 521., 524., 529., 530., 533., 537., 601., 602., 603., 605.,
       608., 610., 614., 615., 618., 622., 623., 624., 705., 708., 709.,
       713., 714., 716., 717., 718., 720., 721., 723., 724., 725., 726.,
       727., 729., 801., 804., 805., 808., 809., 811., 813., 814., 901.,
        nan])

Also this attribute has some errors because the maximum number of state house districts should be 204 (for New Hampshire, see here). For now we won't correct this error beacuse this attribute is not useful for our analysis.

We check if given a certain value for the attributes latitude and longitude, the attribute state_house_district has always the same value:

In [ ]:
incidents_df[incidents_df['state_house_district'].notnull()].groupby(
    ['latitude', 'longitude'])['state_house_district'].unique()[lambda x: x.str.len() > 1].to_frame()
Out[ ]:
state_house_district
latitude longitude
25.7789 -80.1980 [113.0, 109.0]
27.7805 -97.3983 [34.0, 32.0]
29.6807 -95.4252 [134.0, 146.0]
29.7364 -95.5887 [137.0, 133.0]
29.9462 -90.1198 [91.0, 98.0]
... ... ...
44.8293 -69.7514 [111.0, 86.0]
45.5044 -122.5030 [51.0, 47.0]
45.5263 -122.5680 [45.0, 46.0]
45.5740 -122.6380 [43.0, 44.0]
61.2228 -149.7920 [17.0, 14.0]

174 rows × 1 columns

We correct the errors:

In [ ]:
corrected_house_districts = incidents_df[
    incidents_df['state_house_district'].notnull()
    ].groupby(['latitude', 'longitude'])['state_house_district'].agg(lambda x: x.value_counts().index[0])
incidents_df = incidents_df.merge(corrected_house_districts, on=['latitude', 'longitude'], how='left')
incidents_df['state_house_district_y'].fillna(incidents_df['state_house_district_x'], inplace=True)
incidents_df.rename(columns={'state_house_district_y':'state_house_district'}, inplace=True)
incidents_df.drop(columns=['state_house_district_x'], inplace=True)

We now print the unique values the attribute state_senate_district can take on:

In [ ]:
senate_districts = incidents_df['state_senate_district'].unique()
senate_districts.sort()
senate_districts
Out[ ]:
array([ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 13.,
       14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25., 26.,
       27., 28., 29., 30., 31., 32., 33., 34., 35., 36., 37., 38., 39.,
       40., 41., 42., 43., 44., 45., 46., 47., 48., 49., 50., 51., 52.,
       53., 54., 55., 56., 57., 58., 59., 60., 61., 62., 63., 64., 65.,
       66., 67., 94., nan])

And again we notice some errors because the maximum number of state senate districts should be 67 (for Minnesota, see here). For now we won't correct this error beacuse this attribute is not useful for our analysis.

We correct other possible errors as above:

In [ ]:
corrected_senate_districts = incidents_df[
    incidents_df['state_senate_district'].notnull()
    ].groupby(['latitude', 'longitude'])['state_senate_district'].agg(lambda x: x.value_counts().index[0])
incidents_df = incidents_df.merge(corrected_senate_districts, on=['latitude', 'longitude'], how='left')
incidents_df['state_senate_district_y'].fillna(incidents_df['state_senate_district_x'], inplace=True)
incidents_df.rename(columns={'state_senate_district_y':'state_senate_district'}, inplace=True)
incidents_df.drop(columns=['state_senate_district_x'], inplace=True)

We check whether given a state, city_or_county and state_senate_district, the value of the attribute congressional_district is always the same:

In [ ]:
incidents_df[incidents_df['congressional_district'].notnull()].groupby(
    ['state', 'city_or_county', 'state_senate_district'])['congressional_district'].unique()[lambda x: x.str.len() > 1].shape[0]==0
Out[ ]:
False

Hence we cannot recover the missing values for the attribute congressional_district from the values of state_senate_district. We check the same for the attribute state_house_district:

In [ ]:
incidents_df[incidents_df['congressional_district'].notnull()].groupby(
    ['state', 'city_or_county', 'state_house_district'])['congressional_district'].unique()[lambda x: x.str.len() > 1].shape[0]==0
Out[ ]:
False

We cannot recover the missing values for the attribute congressional_district from the values of state_house_district either.

We could, instead, recover the missing values from the entries with "similar" latitude and longitude. To explore this possibility we first plot on a map the dislocation of the incidents, coloring them according to the value of the attribute congressional_district:

In [ ]:
plot_scattermap_plotly(
    incidents_df,
    'congressional_district',
    black_nan=True,
    zoom=2,
    height=800,
    width=800,
    title="USA Congressional districts"
    )

Many points with missing congressional_district are often "surrounded" by points belonging to the same congressional district. We could, therefore, use KNN classifier to recover those values.

We'll do this first for the state of Alabama, showing the results with some plots. Later we will do the same for all the other states.

We plot the distribution of the values of the attribute congressional_district for the state of Alabama:

In [ ]:
plot_scattermap_plotly(
    incidents_df[incidents_df['state']=='ALABAMA'],
    attribute='congressional_district',
    black_nan=True,
    width=500,
    height=600,
    zoom=5.5,
    title="Alabama incidents by Congressional Districts",
    legend_title="Congressional District"
)

We define a function to prepare the data for the classification task:

In [ ]:
def build_X_y_for_district_inference(incidents_df):
    X_train = np.concatenate((
        incidents_df[
            (incidents_df['congressional_district'].notna()) &
            (incidents_df['latitude'].notna()) & 
            (incidents_df['longitude'].notna())
            ]['latitude'].values.reshape(-1, 1),
        incidents_df[
            (incidents_df['congressional_district'].notna()) & 
            (incidents_df['latitude'].notna()) & 
            (incidents_df['longitude'].notna())
            ]['longitude'].values.reshape(-1, 1)),
        axis=1
    )
    X_test = np.concatenate((
        incidents_df[
            (incidents_df['congressional_district'].isna()) & 
            (incidents_df['latitude'].notna()) & 
            (incidents_df['longitude'].notna())
            ]['latitude'].values.reshape(-1, 1),
        incidents_df[
            (incidents_df['congressional_district'].isna()) &
            (incidents_df['latitude'].notna()) & 
            (incidents_df['longitude'].notna())
            ]['longitude'].values.reshape(-1, 1)),
        axis=1
    )
    y_train = incidents_df[
        (incidents_df['congressional_district'].notna()) & 
        (incidents_df['latitude'].notna()) & 
        (incidents_df['longitude'].notna())
        ]['congressional_district'].values
    return X_train, X_test, y_train

We define the function to compute the geodesic distance to pass to the KNN classifier:

In [ ]:
def geodesic_distance(point1, point2):
    return geopy_distance.geodesic(point1, point2).km

Now we are ready to apply the classifier (using K=1):

In [ ]:
X_train, X_test, y_train = build_X_y_for_district_inference(incidents_df[incidents_df['state']=="ALABAMA"])
knn_clf = KNeighborsClassifier(n_neighbors=1, metric=geodesic_distance)
knn_clf.fit(X_train, y_train)
knn_pred = knn_clf.predict(X_test)
incidents_df['KNN_congressional_district'] = incidents_df['congressional_district']
incidents_df.loc[
    (incidents_df['state']=="ALABAMA") &
    (incidents_df['congressional_district'].isna()) &
    (incidents_df['latitude'].notna()) & 
    (incidents_df['longitude'].notna()),
    'KNN_congressional_district'
    ] = knn_pred

We plot the results:

In [ ]:
plot_scattermap_plotly(
    incidents_df[incidents_df['state']=='ALABAMA'],
    attribute='KNN_congressional_district',
    width=500,
    height=600,
    zoom=5.5,
    title="Alabama incidents by Congressional Districts",
    legend_title="Congressional District"
)

To improve the visualization, we plot on the map the decision boundaries of the classifier. To do so, we convert latitude and longitude to a 2D space:

In [ ]:
transformer = Transformer.from_crs("EPSG:4326", "EPSG:26929", always_xy=True) # EPSG:26929 identifies the projected coordinate system for Alabama East (had to choose between E,W,N,S)

X_train_converted = []

for i in range(X_train.shape[0]):
    x, y = transformer.transform(X_train[i][1], X_train[i][0])
    X_train_converted.append([x,y])

X_train_converted = np.array(X_train_converted)

And now we train the classifier using the euclidean distance:

In [ ]:
knn_eu_clf = KNeighborsClassifier(n_neighbors=1, metric='euclidean')
knn_eu_clf.fit(X_train_converted, y_train)
Out[ ]:
KNeighborsClassifier(metric='euclidean', n_neighbors=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(metric='euclidean', n_neighbors=1)

We plot the boundaries of the classifier:

In [ ]:
alabama_color_map = {
    1:'red',
    2:'orange',
    3:'yellow',
    4:'green',
    5:'lightblue',
    6:'blue',
    7:'purple'
}
plot_clf_decision_boundary(knn_eu_clf, X_train_converted, y_train, alabama_color_map, "KNN Alabama borders")

We can now compare the boundaries built by the classifier with the actual boundaries (this map was taken here):

Alt Text

The result is satisfactory. However, it is important to highlight that if there are no examples available for a specific district, we won't assign the correct label to the points in that districts. We check how many congressional districts have 2 or less examples:

In [ ]:
incidents_df.groupby(['state', 'congressional_district']).size()[lambda x: x <= 2]
Out[ ]:
state       congressional_district
NEW JERSEY  14.0                      1
dtype: int64

By the way, missclassification can still occurr, depending on the position of the available examples w.r.t the position of the points to classify. Aware of this limitation, we proceed to apply this method to the other states and plot the result:

In [ ]:
if LOAD_DATA_FROM_CHECKPOINT:
    incidents_df = load_checkpoint('checkpoint_3', date_cols=['date', 'date_original'])
else:
    for state in incidents_df['state'].unique():
        if state != "ALABAMA":
            print(f"{state} done.")
            X_train, X_test, y_train = build_X_y_for_district_inference(incidents_df[incidents_df['state']==state])
            if X_test.shape[0] == 0:
                continue
            knn_clf.fit(X_train, y_train)
            knn_pred = knn_clf.predict(X_test)
            incidents_df.loc[
                (incidents_df['state']==state) &
                (incidents_df['congressional_district'].isna()) &
                (incidents_df['latitude'].notna()) & 
                (incidents_df['longitude'].notna()),
                'KNN_congressional_district'
            ] = knn_pred
    incidents_df.drop(columns=['congressional_district'], inplace=True)
    incidents_df.rename(columns={'KNN_congressional_district':'congressional_district'}, inplace=True)
    save_checkpoint(incidents_df, 'checkpoint_3')

plot_scattermap_plotly(
    incidents_df,
    'congressional_district',
    zoom=2,
    height=800,
    width=800,
    title="USA Congressional districts (after inference)"
)

We now plot on a map the location of the incidents, coloring them according to the value of the attribute state_senate_district and state_house_district, to assess wheter we can apply the same method to recover missing values:

In [ ]:
plot_scattermap_plotly(
    incidents_df,
    'state_senate_district',
    black_nan=True,
    zoom=2,
    height=800,
    width=800,
    title="USA State senate districts"
    )

plot_scattermap_plotly(
    incidents_df,
    'state_house_district',
    black_nan=True,
    zoom=2,
    height=800,
    width=800,
    title="USA State house districts"
    )

These attributes have a lot of missing values, sometimes spread over large areas where there are no other points. Given this scarcity of training examples, we cannot apply the same method to recover the missing values.

Finally we visualize the most frequent addresses:

In [ ]:
incidents_df.groupby(['address']).size().sort_values(ascending=False)[:50].plot(
    kind='bar',
    figsize=(10,6),
    title='Counts of the addresses with the 50 highest number of incidents'
);

Many of the most frequent addresses are located in airports.

Age, gender and number of participants: exploration and preparation¶

We display a concise summary of the attributes related to age, gender and number of participants:

In [ ]:
participants_columns = ['participant_age1', 'participant_age_group1', 'participant_gender1', 
    'min_age_participants', 'avg_age_participants', 'max_age_participants',
    'n_participants_child', 'n_participants_teen', 'n_participants_adult', 
    'n_males', 'n_females',
    'n_killed', 'n_injured', 'n_arrested', 'n_unharmed', 
    'n_participants']
age_df = incidents_df[participants_columns]
age_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 239381 entries, 0 to 239380
Data columns (total 16 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   participant_age1        147360 non-null  float64
 1   participant_age_group1  197508 non-null  object 
 2   participant_gender1     203266 non-null  object 
 3   min_age_participants    159107 non-null  float64
 4   avg_age_participants    159149 non-null  float64
 5   max_age_participants    159065 non-null  float64
 6   n_participants_child    197518 non-null  float64
 7   n_participants_teen     197521 non-null  float64
 8   n_participants_adult    197522 non-null  float64
 9   n_males                 203266 non-null  float64
 10  n_females               203266 non-null  float64
 11  n_killed                239381 non-null  int64  
 12  n_injured               239381 non-null  int64  
 13  n_arrested              211988 non-null  float64
 14  n_unharmed              211988 non-null  float64
 15  n_participants          239381 non-null  float64
dtypes: float64(12), int64(2), object(2)
memory usage: 31.0+ MB
In [ ]:
age_df['participant_age_group1'].unique()
Out[ ]:
array(['Adult 18+', nan, 'Teen 12-17', 'Child 0-11'], dtype=object)

We check if we have entries with non-null values for participant_age1 but NaN for participant_age_group1.

In [ ]:
age_df[age_df['participant_age1'].notna() & age_df['participant_age_group1'].isna()]
Out[ ]:
participant_age1 participant_age_group1 participant_gender1 min_age_participants avg_age_participants max_age_participants n_participants_child n_participants_teen n_participants_adult n_males n_females n_killed n_injured n_arrested n_unharmed n_participants
1279 29.0 NaN Male 29.0 29.0 29.0 NaN NaN NaN 1.0 0.0 0 0 NaN NaN 2.0
3251 34.0 NaN Male 34.0 34.0 34.0 NaN NaN NaN 1.0 0.0 0 0 NaN NaN 1.0
3480 20.0 NaN Female 20.0 20.0 20.0 NaN NaN NaN 0.0 1.0 0 1 0.0 0.0 1.0
4035 23.0 NaN Male 23.0 23.0 23.0 NaN NaN NaN 1.0 0.0 0 1 0.0 0.0 1.0
6445 18.0 NaN Male 18.0 18.0 18.0 NaN NaN NaN 1.0 0.0 0 1 0.0 0.0 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
229593 34.0 NaN Male 34.0 44.0 55.0 -252.0 NaN -355.0 2.0 0.0 1 0 1.0 0.0 2.0
233599 25.0 NaN Male 25.0 25.0 25.0 NaN NaN NaN 1.0 0.0 0 0 NaN NaN 1.0
235505 48.0 NaN Male 48.0 48.0 48.0 NaN NaN NaN 1.0 0.0 0 1 0.0 0.0 1.0
236181 25.0 NaN Male 25.0 25.0 25.0 NaN NaN NaN 1.0 0.0 0 1 0.0 0.0 1.0
238254 33.0 NaN Male 31.0 32.0 33.0 NaN NaN NaN 1.0 2.0 0 0 0.0 3.0 4.0

196 rows × 16 columns

These 126 values can be inferred.

Below, we provide a brief summary of the operations we performed to correct missing and inconsistent values.

First of all, we converted all the consistent values to integers. All the out of range values (e.g. nagative values or improbable ages) were set to NaN. We considered the maximum possible age to be 122 years, as it is the age reached by Jeanne Louise Calment, the world's oldest person.

To identify inconsistencies in the data related to the minimum, maximum, average age of participants, and to the composition of the age groups we checked if:

  • min_age_participants $<$ avg_age_participants $<$ max_age_participants
  • n_participants_child $+$ n_participants_teen $+$ n_participants_adult $>$ 0

  • $if$ min_age_participants $<$ 12 $then$ n_participants_child $>$ 0

  • $if$ 12 $\leq$ min_age_participants $<$ 18 $then$ n_participants_teen $>$ 0
  • $if$ min_age_participants $\geq$ 18 $then$ n_participants_adult $>$ 0

  • $if$ max_age_participants $<$ 12 $then$ n_participants_child $>$ 0 and n_participants_teen $=$ 0 and n_participants_adult $=$ 0

  • $if$ max_age_participants $<$ 18 $then$ n_participants_teen $>$ 0 or n_participants_child $>$ 0 and n_participants_adult $=$ 0
  • $if$ max_age_participants $\geq$ 18 $then$ n_participants_adult $>$ 0

Note that: child = 0-11, teen = 12-17, adult = 18+

To identify inconsistencies in the data related to the number of participants we checked if:

  • n_participants $\geq$ 0
  • n_participants $==$ n_males $+$ n_females
  • n_killed $+$ n_injured $\leq$ n_participants
  • n_arrested $\leq$ n_participants
  • n_unharmed $\leq$ n_participants

Values related to participant_age_group1 and participant_gender1 have been binarized using one-hot encoding, thus creating the boolean features participant1_child, participant1_teen, participant1_adult, participant1_male, participant1_female.

To identify other potential inconsistencies we did the following checks:

  • $if$ participant_age1 $<$ 12 $then$ participant_age_group1 $=$ Child
  • $if$ 12 $\leq$ participant_age1 $<$ 18 $then$ participant_age_group1 $=$ Teen
  • $if$ participant_age1 $\geq$ 18 $then$ participant_age_group1 $==$ Adult

  • $if$ participant_age_group1 $==$ Child $then$ n_participants_child $>$ 0

  • $if$ participant_age_group1 $==$ Teen $then$ n_participants_teen $>$ 0
  • $if$ participant_age_group1 $==$ Adult $then$ n_participants_adult $>$ 0

  • $if$ participant_gender1 $==$ Male $then$ n_males $>$ 0

  • $if$ participant_gender1 $==$ Female $then$ n_females $>$ 0

We kept track of data consistency by using the following variables (the variables were set to True if data were consistent, False if they were not, or NaN when data was missing):

  • consistency_age: to track the consistency between the minimum, maximum, and average ages and the number of participants by age groups
  • consistency_n_participant: to track the consistency between the number of participants
  • consistency_gender: to track the consistency of the gender attribute
  • consistency_participant1: to track the consistency of the attributes related to participant1
  • participant1_age_consistency_wrt_all_data: to track the consistency between the age of participant1 and the other attributes related to ages
  • participant1_age_range_consistency_wrt_all_data: to track the consistency between the age group of participant1 and the other attributes related to age groups
  • participant1_gender_consistency_wrt_all_data: to track the consistency between the gender of participant1 and the other attributes related to gender
  • consistency_participants1_wrt_n_participants: to track the overall consistency between the data about participant1 and the other data

The chunck of code below performes the specified checks:

In [ ]:
from TASK_1.data_preparation_utils import check_age_gender_data_consistency

if LOAD_DATA_FROM_CHECKPOINT: # load data
    age_temporary_df = load_checkpoint('checkpoint_tmp')#, date_cols=['date', 'date_original'])
else: # compute data
    age_temporary_df = age_df.apply(lambda row: check_age_gender_data_consistency(row), axis=1)
    save_checkpoint(age_temporary_df, 'checkpoint_tmp') # save data

In the following, we will display and visualize statistics about the data consistency.

In [ ]:
age_temporary_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 239381 entries, 0 to 239380
Data columns (total 28 columns):
 #   Column                                           Non-Null Count   Dtype  
---  ------                                           --------------   -----  
 0   participant_age1                                 147358 non-null  float64
 1   participant1_child                               197704 non-null  object 
 2   participant1_teen                                197704 non-null  object 
 3   participant1_adult                               197704 non-null  object 
 4   participant1_male                                203265 non-null  object 
 5   participant1_female                              203265 non-null  object 
 6   min_age_participants                             147362 non-null  float64
 7   avg_age_participants                             147358 non-null  float64
 8   max_age_participants                             147359 non-null  float64
 9   n_participants_child                             197514 non-null  float64
 10  n_participants_teen                              197511 non-null  float64
 11  n_participants_adult                             197514 non-null  float64
 12  n_males                                          203266 non-null  float64
 13  n_females                                        203266 non-null  float64
 14  n_killed                                         239381 non-null  int64  
 15  n_injured                                        239381 non-null  int64  
 16  n_arrested                                       211988 non-null  float64
 17  n_unharmed                                       211988 non-null  float64
 18  n_participants                                   214751 non-null  float64
 19  consistency_age                                  239381 non-null  bool   
 20  consistency_n_participant                        211988 non-null  object 
 21  consistency_gender                               203266 non-null  object 
 22  consistency_participant1                         147358 non-null  object 
 23  consistency_participants1_wrt_n_participants     203285 non-null  object 
 24  participant1_age_consistency_wrt_all_data        147358 non-null  object 
 25  participant1_age_range_consistency_wrt_all_data  197704 non-null  object 
 26  participant1_gender_consistency_wrt_all_data     203265 non-null  object 
 27  nan_values                                       239381 non-null  bool   
dtypes: bool(2), float64(12), int64(2), object(12)
memory usage: 49.8+ MB
In [ ]:
age_temporary_df[['consistency_age', 'consistency_n_participant', 'consistency_gender', 
    'consistency_participant1', 'consistency_participants1_wrt_n_participants',
    'participant1_age_consistency_wrt_all_data', 'participant1_age_range_consistency_wrt_all_data',
    'participant1_gender_consistency_wrt_all_data', 'nan_values']].describe()
Out[ ]:
consistency_age consistency_n_participant consistency_gender consistency_participant1 consistency_participants1_wrt_n_participants participant1_age_consistency_wrt_all_data participant1_age_range_consistency_wrt_all_data participant1_gender_consistency_wrt_all_data nan_values
count 239381 211988 203266 147358 203285 147358 197704 203265 239381
unique 2 1 2 2 2 1 2 1 2
top True True True True True True True True False
freq 234328 211988 187103 146259 203095 147358 197514 203265 145602
In [ ]:
print('Number of rows with null values: ', age_temporary_df[age_temporary_df['nan_values'] == True].shape[0])
print('Number of rows with inconsistent values in age data: ', age_temporary_df[age_temporary_df['consistency_age'] == False].shape[0])
print('Number of rows with inconsistent values in number of participants data: ', age_temporary_df[age_temporary_df[
    'consistency_n_participant'] == False].shape[0])
print('Number of rows with inconsistent values in gender data: ', age_temporary_df[age_temporary_df['consistency_gender'] == False].shape[0])
print('Number of rows with inconsistent values in participants1 data: ', age_temporary_df[age_temporary_df[
    'consistency_participant1'] == False].shape[0])
print('Number of rows with inconsistent values for participants1: ', age_temporary_df[age_temporary_df[
    'consistency_participant1'] == False].shape[0])
print('Number of rows with NaN values for participants1: ', age_temporary_df[age_temporary_df[
    'consistency_participant1'] == np.nan].shape[0])
print('Number of rows with inconsistent values in participants1 wrt all other data: ', age_temporary_df[age_temporary_df[
    'consistency_participants1_wrt_n_participants'] == False].shape[0])
print('Number of rows with inconsistent values in participants1 wrt age data: ', age_temporary_df[age_temporary_df[
    'participant1_age_consistency_wrt_all_data'] == False].shape[0])
print('Number of rows with inconsistent values in participants1 wrt age range data: ', age_temporary_df[age_temporary_df[
    'participant1_age_range_consistency_wrt_all_data'] == False].shape[0])
print('Number of rows with inconsistent values in participants1 wrt gender data: ', age_temporary_df[age_temporary_df[
    'participant1_gender_consistency_wrt_all_data'] == False].shape[0])
print('Number of rows with null values in age data: ', age_temporary_df[age_temporary_df['consistency_age'].isna()].shape[0])
print('Number of rows with null values in number of participants data: ', age_temporary_df[age_temporary_df[
    'consistency_n_participant'].isna()].shape[0])
print('Number of rows with null values in gender data: ', age_temporary_df[age_temporary_df['consistency_gender'].isna()].shape[0])
print('Number of rows with null values in participants1 data: ', age_temporary_df[age_temporary_df[
    'consistency_participant1'].isna()].shape[0])
print('Number of rows with all null data: ', age_temporary_df.isnull().all(axis=1).sum())
Number of rows with null values:  93779
Number of rows with inconsistent values in age data:  5053
Number of rows with inconsistent values in number of participants data:  0
Number of rows with inconsistent values in gender data:  16163
Number of rows with inconsistent values in participants1 data:  1099
Number of rows with inconsistent values for participants1:  1099
Number of rows with NaN values for participants1:  0
Number of rows with inconsistent values in participants1 wrt all other data:  190
Number of rows with inconsistent values in participants1 wrt age data:  0
Number of rows with inconsistent values in participants1 wrt age range data:  190
Number of rows with inconsistent values in participants1 wrt gender data:  0
Number of rows with null values in age data:  0
Number of rows with null values in number of participants data:  27393
Number of rows with null values in gender data:  36115
Number of rows with null values in participants1 data:  92023
Number of rows with all null data:  0

We notice that:

  • The data in our dataset related to participant1, excluding the 1099 cases where age and age group data were inconsistent with each other and 190 cases where age range is not consistent, always appear to be consistent with the data in the rest of the dataset and can thus be used to fill in missing values or correct data
  • In the data related to age and gender, some inconsistencies are present, but they account for only 1.88% and 6.01% of the total dataset rows, respectively
  • In 93779 rows, at least one field had a NaN value

We plot the age distribution of participant1 and compare it to the distribution of the minimum and maximum participants' age for each group:

In [ ]:
fig, (ax0, ax1, ax2) = plt.subplots(1, 3, figsize=(20, 6), sharey=True)

ax0.hist(age_temporary_df['participant_age1'], bins=100, edgecolor='black', linewidth=0.8)
ax0.set_xlabel('Age')
ax0.set_ylabel('Frequency')
ax0.set_title('Distribution of age participant1')

ax1.hist(age_temporary_df['min_age_participants'], bins=100, edgecolor='black', linewidth=0.8)
ax1.set_xlabel('Age')
ax1.set_ylabel('Frequency')
ax1.set_title('Distribution of min age participants')

ax2.hist(age_temporary_df['max_age_participants'], bins=100, edgecolor='black', linewidth=0.8)
ax2.set_xlabel('Age')
ax2.set_ylabel('Frequency')
ax2.set_title('Distribution of max age participants')

plt.show()

The similar shapes of the distributions provides confirmation that the data pertaining to participant1 is accurate and reliable. Therefore, we can confidently use participant1's data to fill gaps in cases incidents involved a single participant.

We visualize the number of unique values for the cardinality of participants in each incident and provided a brief summary of this feature:

In [ ]:
print('Values of n_participants: ', age_temporary_df['n_participants'].unique())
age_temporary_df['n_participants'].describe().to_frame()
Values of n_participants:  [  1.   2.   5.   3.   4.  nan   6.   7.   8.  10.   9.  12.  17.  14.
  11.  28.  24.  15.  37.  16.  19.  13.  35.  18.  29.  20.  21.  32.
  27.  52.  22.  26.  47.  30.  23.  63. 103.]
Out[ ]:
n_participants
count 214751.000000
mean 1.826520
std 1.185155
min 1.000000
25% 1.000000
50% 2.000000
75% 2.000000
max 103.000000

We visualize the distribution of the number of participants for each incident using a log scale:

In [ ]:
plt.figure(figsize=(20, 5))
plt.bar(incidents_df.groupby('n_participants')['n_participants'].count().index, incidents_df.groupby('n_participants')['n_participants'].count(),
    alpha=0.8, edgecolor='black', linewidth=0.8)
plt.yscale('log')
plt.xlabel('Number of participants for incidents')
plt.ylabel('Number of incidents')
plt.plot([0.5, 103.5], [1, 1], '--', color='magenta', label='1 incident')
plt.plot([0.5, 103.5], [2, 2], '--', color='red', label='2 incidents')
plt.plot([0.5, 103.5], [10, 10], '--', color='green', label='10 incidents')
plt.plot([0.5, 103.5], [100, 100], '--', color='blue', label='100 incidents')
plt.xticks(range(1, 104, 2), range(1, 104, 2))
plt.legend()
plt.show()

We will now correct inconsistencies in the following manner:

  • when having the number of males (n_males) and number of females (n_females), we set the total number of participants as n_participants = n_males + n_females
  • when having a single participant and consistent data for participants1, we use that data to set the attributes related to age (max, min, average) and gender
In [ ]:
from TASK_1.data_preparation_utils import set_gender_age_consistent_data

if LOAD_DATA_FROM_CHECKPOINT:
    with zipfile.ZipFile('checkpoints/checkpoint_4.csv.zip', 'r') as zip_ref:
        zip_ref.extractall('checkpoints/')
    incidents_df = load_checkpoint('checkpoint_4', date_cols=['date', 'date_original'])
else:
    new_age_df = age_temporary_df.apply(lambda row: set_gender_age_consistent_data(row), axis=1)
    incidents_df[new_age_df.columns] = new_age_df[new_age_df.columns]
    save_checkpoint(incidents_df, 'checkpoint_4')

We visualize the data after the corrections:

In [ ]:
incidents_df.sample(2, random_state=1)
Out[ ]:
date state city_or_county address latitude longitude participant_age1 participant_age_group1 participant_gender1 min_age_participants avg_age_participants max_age_participants n_participants_child n_participants_teen n_participants_adult n_males n_females n_killed n_injured n_arrested n_unharmed n_participants notes incident_characteristics1 incident_characteristics2 year date_original month month_name day day_of_week day_of_week_name county city state_consistency county_consistency address_consistency location_importance address_type state_house_district state_senate_district congressional_district participant1_child participant1_teen participant1_adult participant1_male participant1_female
143491 2016-05-23 FLORIDA Port Charlotte Conway Boulevard 26.9829 -82.0881 18.0 Adult 18+ Male 18.0 19.0 20.0 0.0 0.0 3.0 3.0 0.0 0.0 0.0 2.0 1.0 3.0 shot at bicyclist from blue vehicle Shots Fired - No Injuries Drive-by (car to street, car to car) 2016.0 2016-05-23 5 May 23 0 Monday Charlotte County Port Charlotte 1 1 1 1.000000e-07 place 75.0 28.0 17.0 False False True True False
110344 2017-02-11 PENNSYLVANIA Harrisburg 2400 block of Berryhill St 40.2645 -76.8475 24.0 Adult 18+ Male 24.0 24.0 24.0 0.0 0.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 NaN Shot - Dead (murder, accidental, suicide) Suicide^ 2017.0 2017-02-11 2 February 11 5 Saturday Dauphin County Harrisburg 1 1 1 1.000000e-07 place NaN NaN 4.0 False False True True False
In [ ]:
incidents_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 239381 entries, 0 to 239380
Data columns (total 47 columns):
 #   Column                     Non-Null Count   Dtype         
---  ------                     --------------   -----         
 0   date                       216373 non-null  datetime64[ns]
 1   state                      239381 non-null  object        
 2   city_or_county             239381 non-null  object        
 3   address                    222934 non-null  object        
 4   latitude                   226585 non-null  float64       
 5   longitude                  226585 non-null  float64       
 6   participant_age1           146259 non-null  float64       
 7   participant_age_group1     197508 non-null  object        
 8   participant_gender1        203266 non-null  object        
 9   min_age_participants       144857 non-null  float64       
 10  avg_age_participants       144855 non-null  float64       
 11  max_age_participants       144856 non-null  float64       
 12  n_participants_child       197948 non-null  float64       
 13  n_participants_teen        197948 non-null  float64       
 14  n_participants_adult       197948 non-null  float64       
 15  n_males                    198137 non-null  float64       
 16  n_females                  198137 non-null  float64       
 17  n_killed                   239381 non-null  float64       
 18  n_injured                  239381 non-null  float64       
 19  n_arrested                 211988 non-null  float64       
 20  n_unharmed                 211988 non-null  float64       
 21  n_participants             214751 non-null  float64       
 22  notes                      158558 non-null  object        
 23  incident_characteristics1  239055 non-null  object        
 24  incident_characteristics2  141775 non-null  object        
 25  year                       216373 non-null  float64       
 26  date_original              239381 non-null  datetime64[ns]
 27  month                      239381 non-null  int64         
 28  month_name                 239381 non-null  object        
 29  day                        239381 non-null  int64         
 30  day_of_week                239381 non-null  int64         
 31  day_of_week_name           239381 non-null  object        
 32  county                     218467 non-null  object        
 33  city                       205836 non-null  object        
 34  state_consistency          239381 non-null  int64         
 35  county_consistency         239381 non-null  int64         
 36  address_consistency        239381 non-null  int64         
 37  location_importance        226585 non-null  float64       
 38  address_type               226585 non-null  object        
 39  state_house_district       201845 non-null  float64       
 40  state_senate_district      208065 non-null  float64       
 41  congressional_district     231699 non-null  float64       
 42  participant1_child         196415 non-null  object        
 43  participant1_teen          196415 non-null  object        
 44  participant1_adult         196415 non-null  object        
 45  participant1_male          202178 non-null  object        
 46  participant1_female        202178 non-null  object        
dtypes: datetime64[ns](2), float64(21), int64(6), object(18)
memory usage: 87.7+ MB
In [ ]:
print('Number of rows in which all data are null: ', incidents_df.isnull().all(axis=1).sum())
print('Number of rows with some null data: ', incidents_df.isnull().any(axis=1).sum())
print('Number of rows in which number of participants is null: ', incidents_df[incidents_df['n_participants'].isnull()].shape[0])
print('Number of rows in which number of participants is 0: ', incidents_df[incidents_df['n_participants'] == 0].shape[0])
print('Number of rows in which number of participants is null and n_killed is not null: ', incidents_df[
    incidents_df['n_participants'].isnull() & incidents_df['n_killed'].notnull()].shape[0])
print('Total rows with null value for n_participants: ', incidents_df['n_participants'].isnull().sum())
print('Total rows with null value for n_participants_child: ', incidents_df['n_participants_child'].isnull().sum())
print('Total rows with null value for n_participants_teen: ', incidents_df['n_participants_teen'].isnull().sum())
print('Total rows with null value for n_participants_adult: ', incidents_df['n_participants_adult'].isnull().sum())
print('Total rows with null value for n_males: ', incidents_df['n_males'].isnull().sum())
print('Total rows with null value for n_females: ', incidents_df['n_females'].isnull().sum())
Number of rows in which all data are null:  0
Number of rows with some null data:  196882
Number of rows in which number of participants is null:  24630
Number of rows in which number of participants is 0:  0
Number of rows in which number of participants is null and n_killed is not null:  24630
Total rows with null value for n_participants:  24630
Total rows with null value for n_participants_child:  41433
Total rows with null value for n_participants_teen:  41433
Total rows with null value for n_participants_adult:  41433
Total rows with null value for n_males:  41244
Total rows with null value for n_females:  41244
In [ ]:
sns.heatmap(incidents_df.isnull(), cbar=False)
Out[ ]:
<Axes: >

We recovered all the data related to age and gender. In 98973 entries, at most a value is missing.

We now explore the distribution of the total number of participants and the number of participants per age group. Once again we use a logaritmic scale in the y-axis:

In [ ]:
def plot_hist(df_column, n_bin=100, density=True, title=None, y_label=None, color=None, y_logscale=False):
    
    def iqr_fence(x):
        q1 = x.quantile(0.25)
        med = x.quantile(0.5)
        q3 = x.quantile(0.75)
        IQR = q3 - q1
        u = x.max()
        l = x.min()
        Lower_Fence = builtins.max(builtins.min(x[x > q1 - (1.5 * IQR)], default=pd.Timestamp.min), l)
        #Lower_Fence = builtins.max(q1 - (1.5 * IQR), l)
        Upper_Fence = builtins.min(builtins.max(x[x < q3 + (1.5 * IQR)], default=pd.Timestamp.max), u)
        #Upper_Fence = builtins.min(q3 + (1.5 * IQR), u)
        return [Lower_Fence, q1, med, q3, Upper_Fence]
    relevant_positions = iqr_fence(df_column)
    n_items = len(df_column.index)

    fig, axs = plt.subplots(3, sharex=True, figsize=(14, 6))
    fig.suptitle(title)

    # fixed bin
    axs[0].hist(df_column, bins=n_bin, density=density, color=color)
    axs[0].set_ylabel(str(n_bin) + ' bin')
    axs[0].grid(axis='y')
    if y_logscale:
        axs[0].set_yscale('log')

    # number of bins computed using Sturge's rule
    n_bin = int(1 + math.log2(n_items))
    axs[1].hist(df_column, bins=n_bin, density=density, color=color)
    axs[1].set_ylabel("Sturge\'s rule binning")
    if y_logscale:
        axs[1].set_yscale('log')
    axs[1].grid(axis='y')

    axs[2].boxplot(x=df_column.dropna().values, labels=[''], vert=False)
    axs[2].set_xlabel(y_label)

    for i in range(2):
        axs[i].axvline(x = relevant_positions[0], color = 'black', linestyle = '--', alpha=0.75)
        axs[i].axvline(x = relevant_positions[1], color = 'black', linestyle = '-.', alpha=0.75)
        axs[i].axvline(x = relevant_positions[2], color = 'black', linestyle = '-.', alpha=0.75)
        axs[i].axvline(x = relevant_positions[3], color = 'black', linestyle = '-.', alpha=0.75)
        axs[i].axvline(x = relevant_positions[4], color = 'black', linestyle = '--', alpha=0.75)
    
    return fig
In [ ]:
plot_hist(incidents_df['n_participants'], title='Distribution of number of participants', n_bin=104, y_label='n_participants', density=False, y_logscale=True);
In [ ]:
plt.figure(figsize=(20, 5))
plt.hist(incidents_df['n_participants'], bins=104, edgecolor='black', linewidth=0.8)
plt.xlabel('Number of participants')
plt.ylabel('Frequency (log scale)')
plt.xticks(np.arange(1, 104, 2))
plt.yscale('log')
plt.title('Distribution of number of participants')
plt.show()
In [ ]:
incidents_df[['n_participants', 'n_participants_child', 'n_participants_teen', 'n_participants_adult']].max().to_frame().rename(columns={0:'max value'})
Out[ ]:
max value
n_participants 103.0
n_participants_child 7.0
n_participants_teen 27.0
n_participants_adult 103.0
In [ ]:
incidents_df[incidents_df['n_participants_adult'] > 60][['n_participants', 'n_participants_adult', 
    'n_participants_child', 'n_participants_teen']]
Out[ ]:
n_participants n_participants_adult n_participants_child n_participants_teen
179866 63.0 63.0 0.0 0.0
235978 103.0 103.0 0.0 0.0
In [ ]:
fig, (ax0, ax1, ax2) = plt.subplots(3, 1, figsize=(20, 12), sharex=True, sharey=True)

ax0.bar(incidents_df['n_participants_child'].value_counts().index, incidents_df['n_participants_child'].value_counts(),
    alpha=0.8, color='magenta', edgecolor='black', linewidth=0.8, label='Children')
ax0.legend()
ax1.bar(incidents_df['n_participants_teen'].value_counts().index, incidents_df['n_participants_teen'].value_counts(),
    alpha=0.8, color='red', edgecolor='black', linewidth=0.8, label='Teen')
ax1.legend()
ax2.bar(incidents_df['n_participants_adult'].value_counts().index, incidents_df['n_participants_adult'].value_counts(),
    color='orange', edgecolor='black', linewidth=0.8, label='Adult')
ax2.legend()

plt.xlim(-1, 64)
plt.xticks(range(0, 64))
plt.yscale('log')
plt.xlabel('Number of participants')
ax0.set_ylabel('Number of incidents')
ax1.set_ylabel('Numer of incidents')
ax2.set_ylabel('Numer of incidents')
ax0.set_title('Number of participants for each incident per age')
plt.show()

We observe that in incidents involving children and teenagers under the age of 18, the total number of participants is smaller than 7 and 27, respectively. In general, incidents involving a single person are much more frequent than other incidents, and most often, they involve teenagers and children, with a smaller percentage involving adults. On the other hand, incidents with more than one participant mostly consist of adults, and as the number of participants increases, the frequency of such incidents decreases.

We also plot the distribution of the number of incidents per gender:

In [ ]:
plt.figure(figsize=(20, 5))
plt.bar(incidents_df['n_males'].value_counts().index-0.2, incidents_df['n_males'].value_counts(), 0.4,
    edgecolor='black', linewidth=0.8, label='Males participants')
plt.bar(incidents_df['n_females'].value_counts().index+0.2, incidents_df['n_females'].value_counts(), 0.4,
    edgecolor='black', linewidth=0.8, label='Females participants')
plt.xticks(range(0, 64))
plt.yscale('log')
plt.xlabel('Number of participants')
plt.ylabel('Number of incidents')
plt.legend()
plt.title('Number of participants for each incident per gender')
plt.show()

Below, we plot the distribution of the average age of participants in each incident.

In [ ]:
plot_hist(incidents_df['avg_age_participants'], y_label='avg_age_participants', density=False);

Incident characteristics features: exploration and preparation¶

We use a word cloud to display the most frequent words in the attribut notes:

In [ ]:
nltk.download('stopwords')
stopwords_set = set(stopwords.words('english'))

word_cloud_all_train = WordCloud(
    width=800,
    height=400,
    stopwords=stopwords_set,
    collocations=False,
    background_color='white'
    ).generate(' '.join(incidents_df[incidents_df['notes'].notna()]['notes'].tolist()));
word_cloud_all_train.to_svg()
plt.figure( figsize=(20,10) )
plt.imshow(word_cloud_all_train)
plt.axis('off');
plt.title('Word cloud of notes\n', fontsize=40, fontweight='bold');
plt.savefig("../html/word_cloud_notes.svg")
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/irenetesta/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

We check if given the first characteristic of a record, the second one is different

In [ ]:
incidents_df[incidents_df['incident_characteristics1']==incidents_df['incident_characteristics2']].shape[0]==0
Out[ ]:
True

We plot the frequency of each characteristic:

In [ ]:
# merge characteristics list
ch1_counts = incidents_df['incident_characteristics1'].value_counts()
ch2_counts = incidents_df['incident_characteristics2'].value_counts()
ch_counts = ch1_counts.add(ch2_counts, fill_value=0).sort_values(ascending=True)
ch_counts.to_frame()
Out[ ]:
0
Police Targeted 1.0
Defensive Use - Crime occurs, victim shoots subject/suspect/perpetrator 1.0
Cleaning gun 1.0
Child with gun - no shots fired 1.0
Officer Involved Shooting - subject/suspect/perpetrator unarmed 1.0
... ...
Officer Involved Incident 14571.0
Shots Fired - No Injuries 35700.0
Non-Shooting Incident 44665.0
Shot - Dead (murder, accidental, suicide) 53396.0
Shot - Wounded/Injured 93910.0

91 rows × 1 columns

In [ ]:
fig = ch_counts.plot(kind='barh', figsize=(5, 18))
fig.set_xscale("log")
plt.title("Counts of 'incident_characteristics'")
plt.xlabel('Count')
plt.ylabel('Incident characteristics')
plt.tight_layout()
/var/folders/8n/f3j9y6bd3zd_kp83y0vvn_b80000gn/T/ipykernel_5016/2459086947.py:6: UserWarning:

Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.

In [ ]:
ch1_females_counts = incidents_df[incidents_df['n_females']>1]['incident_characteristics1'].value_counts()
ch2_females_counts = incidents_df[incidents_df['n_females']>1]['incident_characteristics2'].value_counts()
ch_females_counts = ch1_females_counts.add(ch2_females_counts, fill_value=0).sort_values(ascending=False).plot(
    kind='bar',
    title='Characteristics counts of incidents with females involved',
    figsize=(20,10)
)
In [ ]:
characteristics_count_matrix = pd.crosstab(incidents_df['incident_characteristics2'], incidents_df['incident_characteristics1'])
fig, ax = plt.subplots(figsize=(25, 20))
sns.heatmap(characteristics_count_matrix, cmap='coolwarm', ax=ax, xticklabels=True, yticklabels=True, linewidths=.5)
ax.set_xlabel('incident_characteristics1')
ax.set_ylabel('incident_characteristics2')  
ax.set_title('Counts of incident characteristics')
plt.tight_layout()
In [ ]:
characteristics_count_matrix[["Shot - Dead (murder, accidental, suicide)"]].sort_values(
    by="Shot - Dead (murder, accidental, suicide)",
    inplace=False,
    ascending=False).plot(
        kind='bar',
        figsize=(20,10)
    );

We defined the following binary tags to categorize the characteristics of each incident:

  • firearm: it tells if firearms were involved in the incident
  • air_gun: it tells if air guns were involved in the incident
  • shots: it tells if the incident involved shots
  • aggression: it tells if there was an aggression (both using a gun or not)
  • suicide: it tells if the incident involved a suicide (attempts are included)
  • injuries: it tells if one ore more subjects got injured
  • death: it tells if one ore more subjects died
  • road: it tells if the incident happened in a road
  • illegal_holding: it tells if the incident involved a stealing act or if a gun was illegaly possessed
  • house: it tells if the incident happened in a house
  • school: it tells if the incident happened next to a school
  • children: it tells if the incident involved one or more children
  • drugs: it tells if the incident involved drugs
  • officers: it tells if one or more officiers were involved in the incident
  • organized: it tells if the incident was planned by an organization or a group
  • social_reasons: it tells if the incident involved social discriminations or terrorism
  • defensive: it tells if there was a defensive use of a gun during the incident
  • workplace: it tells if the incident happened in a workplace
  • abduction: it tells if the incident involved any form of abduction
  • unintentional: it tells if the incident was unintentional

Each tag was set to True if and only if we had enough information to assume that the incident had that particular characteristic.

We set all the tags and check their consistency w.r.t. the other data:

In [ ]:
from TASK_1.data_preparation_utils import add_tags, check_tag_consistency, check_characteristics_consistency, IncidentTag

tags_columns = [tag.name for tag in IncidentTag]
tags_columns.append('tag_consistency')

if LOAD_DATA_FROM_CHECKPOINT:
    with zipfile.ZipFile('checkpoints/checkpoint_5.csv.zip', 'r') as zip_ref:
        zip_ref.extractall('checkpoints/')
    incidents_df = load_checkpoint('checkpoint_5', date_cols=['date', 'date_original'])
else:
    incidents_df = add_tags(incidents_df)
    incidents_df['tag_consistency'] = True
    incidents_df = incidents_df.apply(lambda row: check_tag_consistency(row), axis=1)
    incidents_df = incidents_df.apply(lambda row: check_characteristics_consistency(row), axis=1)
    save_checkpoint(incidents_df, 'checkpoint_5')
In [ ]:
incidents_df['tag_consistency'].value_counts().to_frame()
Out[ ]:
tag_consistency
True 221461
False 17920

We correct the inconsistencies in the tag assuming the numerical data are consistent and we save again the dataset.

In [ ]:
from TASK_1.data_preparation_utils import set_tags_consistent_data

if LOAD_DATA_FROM_CHECKPOINT:
    with zipfile.ZipFile('checkpoints/checkpoint_6.csv.zip', 'r') as zip_ref:
        zip_ref.extractall('checkpoints/')
    incidents_df = load_checkpoint('checkpoint_6', date_cols=['date', 'date_original'])
else:
    for index, record in incidents_df.iterrows():
        if record['tag_consistency'] == False:
            incidents_df.loc[index] = set_tags_consistent_data(record)
            incidents_df.loc[index, 'tag_consistency'] = True # set to true and then check again if it's still not consistent
    incidents_df = incidents_df.apply(lambda row: check_tag_consistency(row), axis=1)
    incidents_df = incidents_df.apply(lambda row: check_characteristics_consistency(row), axis=1)
    save_checkpoint(incidents_df, 'checkpoint_6')
In [ ]:
pd.DataFrame(data=incidents_df.dtypes).T
Out[ ]:
date state city_or_county address latitude longitude participant_age1 participant_age_group1 participant_gender1 min_age_participants avg_age_participants max_age_participants n_participants_child n_participants_teen n_participants_adult n_males n_females n_killed n_injured n_arrested n_unharmed n_participants notes incident_characteristics1 incident_characteristics2 year date_original month month_name day day_of_week day_of_week_name county city state_consistency county_consistency address_consistency location_importance address_type state_house_district state_senate_district congressional_district participant1_child participant1_teen participant1_adult participant1_male participant1_female firearm air_gun shots aggression suicide injuries death road illegal_holding house school children drugs officers organized social_reasons defensive workplace abduction unintentional tag_consistency
0 datetime64[ns] object object object float64 float64 float64 object object float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 object object object float64 datetime64[ns] int64 object int64 int64 object object object int64 int64 int64 float64 object float64 float64 float64 object object object object object bool bool bool bool bool bool bool bool bool bool bool bool bool bool bool bool bool bool bool bool bool
In [ ]:
incidents_df['tag_consistency'].value_counts().to_frame()
Out[ ]:
tag_consistency
True 221461
False 17920

We display the frequencies of the tags grouping them:

In [ ]:
tags_counts = {}
tags_counts['Murder'] = incidents_df[
    (incidents_df['death']==True) &
    ((incidents_df['aggression']==True) |
    (incidents_df['social_reasons']==True))].shape[0] # not accidental nor defensive
tags_counts['Suicide'] = incidents_df[
    (incidents_df['death']==True) &
    (incidents_df['suicide']==True)].shape[0] # warninig: if murder/suicide is counted twice
tags_counts['Defensive'] = incidents_df[
    (incidents_df['death']==True) &
    (incidents_df['defensive']==True)].shape[0]
tags_counts['Accidental'] = incidents_df[
    (incidents_df['death']==True) &
    (incidents_df['unintentional']==True)].shape[0]
tags_counts['Others or not known'] = incidents_df[
    (incidents_df['death']==True) &
    (incidents_df['aggression']==False) &
    (incidents_df['social_reasons']==False) &
    (incidents_df['unintentional']==False)].shape[0]

fig, ax = plt.subplots()
total = sum(tags_counts.values())
ax.pie(tags_counts.values())
legend_labels = [f'{label}: {(size/total)*100:.1f}%' for label, size in tags_counts.items()]
plt.legend(legend_labels)
plt.title("Gun incidents")
plt.show()
fig.savefig("../html/pie_incident_type.svg")

Most of the incidents involved Murder. Suicide, Defensive and Accidental are very few compare to murders. The other big slice of the pie belongs to 'Others', showing that there are a lot of different incidents that are less common.

We show the frequency of the values of each singular tag:

In [ ]:
ax = (incidents_df[tags_columns].apply(lambda col: col.value_counts()).T.sort_values(by=True)/incidents_df.shape[0]*100).plot(kind='barh', stacked=True, alpha=0.8, edgecolor='black')
for container in ax.containers:
    ax.bar_label(container, fmt='%.1f%%', label_type='center', fontsize=8)
plt.title("Incidents characteristic (%)")
Out[ ]:
Text(0.5, 1.0, 'Incidents characteristic (%)')

The most common tags are firearm, shots, aggression and injuries (above 50% of the records), in particular firearm is True for almost every record (97.8 %). On the other hand there are tags (air_gun, school, social_reasons and abduction) that are very rare.

We check for correlations between accidental incidents and the presence of children.

In [ ]:
incidents_df['unintentional'].corr(incidents_df['n_participants_child']>0)
Out[ ]:
0.15398486295938882

The two events are not correlated.

We display the most common characteristics for incidents involving women.

In [ ]:
ch1_females_counts = incidents_df[incidents_df['n_females']>1]['incident_characteristics1'].value_counts()
ch2_females_counts = incidents_df[incidents_df['n_females']>1]['incident_characteristics2'].value_counts()
ch_females_counts = ch1_females_counts.add(ch2_females_counts, fill_value=0).sort_values(ascending=False).plot(
    kind='bar',
    title='Characteristics counts of incidents with females involved',
    figsize=(20,10)
)

The distribution is very similar to the one involving both men and women. Some of the main differences are that, for women, the frequency of suicides is higher, while the involvemnte of officiers is lower.

We plot on a map the location of mass shootings, incidents involving children and suicides:

In [ ]:
plot_scattermap_plotly(
    incidents_df[incidents_df['n_killed']>=4],
        zoom=2,
        title='Mass shootings'
)
In [ ]:
plot_scattermap_plotly(
    incidents_df[incidents_df['children']==True],
    zoom=2,
    title='Incidents involving children'
)
In [ ]:
plot_scattermap_plotly(
    incidents_df[incidents_df['suicide']==True],
    zoom=2,
    title='Suicides'
)

The dislocation of these kind of incidents is similar to the one of the whole dataset.

Joint analysis of the datasets¶

We join the poverty data with the incidents data:

In [ ]:
poverty_df['state'] = poverty_df['state'].str.upper()
incidents_df = incidents_df.merge(poverty_df, on=['state', 'year'], how='left', validate="m:1")
incidents_df.head()
Out[ ]:
date state city_or_county address latitude longitude participant_age1 participant_age_group1 participant_gender1 min_age_participants avg_age_participants max_age_participants n_participants_child n_participants_teen n_participants_adult n_males n_females n_killed n_injured n_arrested n_unharmed n_participants notes incident_characteristics1 incident_characteristics2 year date_original month month_name day day_of_week day_of_week_name county city state_consistency county_consistency address_consistency location_importance address_type state_house_district state_senate_district congressional_district participant1_child participant1_teen participant1_adult participant1_male participant1_female firearm air_gun shots aggression suicide injuries death road illegal_holding house school children drugs officers organized social_reasons defensive workplace abduction unintentional tag_consistency povertyPercentage px_code
0 2015-05-02 INDIANA Indianapolis Lafayette Road and Pike Plaza 39.8322 -86.2492 19.0 Adult 18+ Male 19.0 19.0 19.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 Teen wounded while walking - Security guard at nearby carnival aided victim. Shot - Wounded/Injured NaN 2015.0 2015-05-02 5 May 2 5 Saturday Marion County Indianapolis 1 1 1 1.000100e-01 road 94.0 33.0 7.0 False False True True False True False True True False True False False False False False False False False False False False False False False True 12.3 IN
1 2017-04-03 PENNSYLVANIA Kane 5647 US 6 41.6645 -78.7856 62.0 Adult 18+ Male 62.0 62.0 62.0 0.0 0.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 shot self after accident Shot - Dead (murder, accidental, suicide) Suicide^ 2017.0 2017-04-03 4 April 3 0 Monday McKean County Wetmore Township 1 -1 1 1.000100e-01 road NaN NaN 5.0 False False True True False True False True False True False True False False False False False False False False False False False False False True 10.5 PA
2 2016-11-05 MICHIGAN Detroit 6200 Block of East McNichols Road 42.4190 -83.0393 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 1.0 0.0 1.0 2.0 1 inj. Shot - Wounded/Injured NaN 2016.0 2016-11-05 11 November 5 5 Saturday Wayne County Detroit 1 1 1 1.000000e-07 place 4.0 2.0 14.0 NaN NaN NaN NaN NaN True False True True False True False False False False False False False False False False False False False False True 11.0 MI
3 2016-10-15 DISTRICT OF COLUMBIA Washington 1000 block of Bladensburg Road, NE 38.9030 -76.9820 NaN Adult 18+ Male NaN NaN NaN NaN NaN NaN 1.0 0.0 0.0 1.0 0.0 0.0 2.0 NaN Shot - Wounded/Injured NaN 2016.0 2016-10-15 10 October 15 5 Saturday Trinidad Washington 1 1 1 1.000000e-05 building NaN NaN 0.0 False False True True False True False True True False True False False False False False False False False False False False False False False True 14.9 DC
4 NaT PENNSYLVANIA Pittsburgh California and Marshall Avenues 40.4621 -80.0308 NaN Adult 18+ Male NaN NaN NaN NaN NaN NaN 1.0 0.0 0.0 1.0 0.0 1.0 2.0 NaN Shot - Wounded/Injured Drive-by (car to street, car to car) NaN 2030-06-14 6 June 14 4 Friday Allegheny County Pittsburgh 1 1 1 1.000100e-01 road NaN NaN 14.0 False False True True False True False True True False True False True False False False False False False False False False False False False True NaN NaN

We join the elections data with the incidents data:

In [ ]:
elections_df_copy = elections_df.copy()
elections_df_copy['year'] = elections_df_copy['year'] + 1
elections_df = pd.concat([elections_df, elections_df_copy], ignore_index=True)
incidents_df = incidents_df.merge(elections_df, on=['state', 'year', 'congressional_district'], how='left')
incidents_df.head()
Out[ ]:
date state city_or_county address latitude longitude participant_age1 participant_age_group1 participant_gender1 min_age_participants avg_age_participants max_age_participants n_participants_child n_participants_teen n_participants_adult n_males n_females n_killed n_injured n_arrested n_unharmed n_participants notes incident_characteristics1 incident_characteristics2 year date_original month month_name day day_of_week day_of_week_name county city state_consistency county_consistency address_consistency location_importance address_type state_house_district state_senate_district congressional_district participant1_child participant1_teen participant1_adult participant1_male participant1_female firearm air_gun shots aggression suicide injuries death road illegal_holding house school children drugs officers organized social_reasons defensive workplace abduction unintentional tag_consistency povertyPercentage px_code party candidatevotes totalvotes candidateperc
0 2015-05-02 INDIANA Indianapolis Lafayette Road and Pike Plaza 39.8322 -86.2492 19.0 Adult 18+ Male 19.0 19.0 19.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 Teen wounded while walking - Security guard at nearby carnival aided victim. Shot - Wounded/Injured NaN 2015.0 2015-05-02 5 May 2 5 Saturday Marion County Indianapolis 1 1 1 1.000100e-01 road 94.0 33.0 7.0 False False True True False True False True True False True False False False False False False False False False False False False False False True 12.3 IN DEMOCRAT 61443.0 112261.0 54.732276
1 2017-04-03 PENNSYLVANIA Kane 5647 US 6 41.6645 -78.7856 62.0 Adult 18+ Male 62.0 62.0 62.0 0.0 0.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 shot self after accident Shot - Dead (murder, accidental, suicide) Suicide^ 2017.0 2017-04-03 4 April 3 0 Monday McKean County Wetmore Township 1 -1 1 1.000100e-01 road NaN NaN 5.0 False False True True False True False True False True False True False False False False False False False False False False False False False True 10.5 PA REPUBLICAN 206761.0 307843.0 67.164431
2 2016-11-05 MICHIGAN Detroit 6200 Block of East McNichols Road 42.4190 -83.0393 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 1.0 0.0 1.0 2.0 1 inj. Shot - Wounded/Injured NaN 2016.0 2016-11-05 11 November 5 5 Saturday Wayne County Detroit 1 1 1 1.000000e-07 place 4.0 2.0 14.0 NaN NaN NaN NaN NaN True False True True False True False False False False False False False False False False False False False False True 11.0 MI DEMOCRAT 244135.0 310974.0 78.506563
3 2016-10-15 DISTRICT OF COLUMBIA Washington 1000 block of Bladensburg Road, NE 38.9030 -76.9820 NaN Adult 18+ Male NaN NaN NaN NaN NaN NaN 1.0 0.0 0.0 1.0 0.0 0.0 2.0 NaN Shot - Wounded/Injured NaN 2016.0 2016-10-15 10 October 15 5 Saturday Trinidad Washington 1 1 1 1.000000e-05 building NaN NaN 0.0 False False True True False True False True True False True False False False False False False False False False False False False False False True 14.9 DC DEMOCRAT 265178.0 300906.0 88.126525
4 NaT PENNSYLVANIA Pittsburgh California and Marshall Avenues 40.4621 -80.0308 NaN Adult 18+ Male NaN NaN NaN NaN NaN NaN 1.0 0.0 0.0 1.0 0.0 1.0 2.0 NaN Shot - Wounded/Injured Drive-by (car to street, car to car) NaN 2030-06-14 6 June 14 4 Friday Allegheny County Pittsburgh 1 1 1 1.000100e-01 road NaN NaN 14.0 False False True True False True False True True False True False True False False False False False False False False False False False False True NaN NaN NaN NaN NaN NaN

We read and join the data about the USA population from the 2010 census downloaded from Wikipedia.

In [ ]:
usa_population_df = pd.read_csv(DATA_FOLDER_PATH + 'external_data/2010_United_States_census.csv')
In [ ]:
usa_population_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 6 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   Rank                          51 non-null     object
 1   State                         52 non-null     object
 2   Population as of 2010 census  52 non-null     object
 3   Population as of 2000 census  52 non-null     object
 4   Change                        52 non-null     object
 5   Percent change                52 non-null     object
dtypes: object(6)
memory usage: 2.6+ KB
In [ ]:
usa_population_df.drop(columns=['Population as of 2000 census', 'Change', 'Percent change', 'Rank'], inplace=True)
usa_population_df.rename(columns={'Population as of 2010 census':'population_state_2010', 'State': 'state'}, inplace=True)
usa_population_df['state'] = usa_population_df['state'].str.upper()
usa_population_df['population_state_2010'] = usa_population_df['population_state_2010'].str.replace(',', '').astype('int64')
incidents_df = incidents_df.merge(usa_population_df, on=['state'], how='left')
incidents_df.sample(5, random_state=1)
Out[ ]:
date state city_or_county address latitude longitude participant_age1 participant_age_group1 participant_gender1 min_age_participants avg_age_participants max_age_participants n_participants_child n_participants_teen n_participants_adult n_males n_females n_killed n_injured n_arrested n_unharmed n_participants notes incident_characteristics1 incident_characteristics2 year date_original month month_name day day_of_week day_of_week_name county city state_consistency county_consistency address_consistency location_importance address_type state_house_district state_senate_district congressional_district participant1_child participant1_teen participant1_adult participant1_male participant1_female firearm air_gun shots aggression suicide injuries death road illegal_holding house school children drugs officers organized social_reasons defensive workplace abduction unintentional tag_consistency povertyPercentage px_code party candidatevotes totalvotes candidateperc population_state_2010
143491 2016-05-23 FLORIDA Port Charlotte Conway Boulevard 26.9829 -82.0881 18.0 Adult 18+ Male 18.0 19.0 20.0 0.0 0.0 3.0 3.0 0.0 0.0 0.0 2.0 1.0 3.0 shot at bicyclist from blue vehicle Shots Fired - No Injuries Drive-by (car to street, car to car) 2016.0 2016-05-23 5 May 23 0 Monday Charlotte County Port Charlotte 1 1 1 1.000000e-07 place 75.0 28.0 17.0 False False True True False True False True True False False False True False False False False False False False False False False False False True 13.6 FL REPUBLICAN 209348.0 338675.0 61.813833 18801310
110344 2017-02-11 PENNSYLVANIA Harrisburg 2400 block of Berryhill St 40.2645 -76.8475 24.0 Adult 18+ Male 24.0 24.0 24.0 0.0 0.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 NaN Shot - Dead (murder, accidental, suicide) Suicide^ 2017.0 2017-02-11 2 February 11 5 Saturday Dauphin County Harrisburg 1 1 1 1.000000e-07 place NaN NaN 4.0 False False True True False True False True False True False True False False False False False False False False False False False False False True 10.5 PA REPUBLICAN 220628.0 334000.0 66.056287 12702379
62395 2016-07-30 FLORIDA Jacksonville Beach 2nd Street and 3rd Avenue 30.2855 -81.3900 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 NaN NaN NaN across from Ritz Lounge, casings found Institution/Group/Business Shots Fired - No Injuries 2016.0 2016-07-30 7 July 30 5 Saturday Duval County Jacksonville Beach 1 1 1 1.000000e-07 place 11.0 4.0 4.0 NaN NaN NaN NaN NaN True False True True False False False False False False False False False False False False False True False False True 13.6 FL REPUBLICAN 287509.0 409662.0 70.182004 18801310
159518 2015-12-19 NEBRASKA Omaha 12441 Miami St. 41.2834 -96.1073 24.0 Adult 18+ Male 24.0 24.0 24.0 0.0 0.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 NaN Shot - Dead (murder, accidental, suicide) Accidental Shooting 2015.0 2015-12-19 12 December 19 5 Saturday Douglas County Omaha 1 1 1 1.000000e-07 place NaN 6.0 2.0 False False True True False True False True False False False True False False False False False False False False False False False False True True 10.1 NE DEMOCRAT 83872.0 171509.0 48.902390 1826341
26161 NaT NORTH CAROLINA Salisbury 1505 E. Innes Street 35.6533 -80.4572 NaN Adult 18+ Male NaN NaN NaN NaN NaN NaN 1.0 0.0 0.0 0.0 0.0 1.0 2.0 Walgreens south of Interstate 85 Armed robbery with injury/death and/or evidence of DGU found Institution/Group/Business NaN 2029-06-04 6 June 4 0 Monday Rowan County Salisbury 1 1 1 1.000000e-05 shop 76.0 34.0 12.0 False False True True False True False False True False False False False True False False False False False False False False True False False False NaN NaN NaN NaN NaN NaN 9535483

We plot the number of incidents per state per 100k inhabitants:

In [ ]:
incidents_per_state = incidents_df[incidents_df['year']<=2020].groupby(['state', 'population_state_2010']).size()
incidents_per_state = ((incidents_per_state / incidents_per_state.index.get_level_values('population_state_2010'))*100000).to_frame(name='incidents_per_100k_inhabitants').sort_values(by='incidents_per_100k_inhabitants', ascending=True)
incidents_per_state.reset_index(inplace=True)
incidents_per_state.plot(
    kind='barh',
    x='state',
    y='incidents_per_100k_inhabitants',
    figsize=(15, 10),
    ylabel='State',
    xlabel='Incidents per 100k inhabitants',
    title='Incidents per 100k inhabitants per state'
)
Out[ ]:
<Axes: title={'center': 'Incidents per 100k inhabitants per state'}, xlabel='Incidents per 100k inhabitants', ylabel='State'>

District of Columbia has the highest number of incidents per 100k inhabitants.

We display the tag frequency for the District of Columbia:

In [ ]:
incidents_df[incidents_df['state']=='DISTRICT OF COLUMBIA']['incident_characteristics1'].value_counts().plot(kind='barh', figsize=(20, 10))
Out[ ]:
<Axes: >
In [ ]:
incidents_df[(incidents_df['state']=='DISTRICT OF COLUMBIA')&(incidents_df['year']<2014)].shape
Out[ ]:
(5, 75)

We visualize the number of incidents happened in each state every month:

In [ ]:
incidents_df['year'] = incidents_df['year'].astype('UInt64')
incidents_per_month_per_state = incidents_df.groupby(['state', 'month', 'year']).size()
incidents_per_month_per_state = incidents_per_month_per_state.to_frame(name='incidents').reset_index()
incidents_per_month_per_state = incidents_per_month_per_state.sort_values(by=['year', 'month', 'state'], ignore_index=True)
incidents_per_month_per_state['incidents_per_100k_inhabitants'] = incidents_per_month_per_state.apply(
    lambda row: (row['incidents'] / usa_population_df[usa_population_df['state']==row['state']]['population_state_2010'].iloc[0])*100000,
    axis=1
)
fig, ax = plt.subplots(figsize=(20, 10))
sns.heatmap(
    incidents_per_month_per_state[incidents_per_month_per_state.year<=2020].pivot(
        index='state',
        columns=['year', 'month'],
        values='incidents_per_100k_inhabitants'
    ).fillna(0),
    cmap='coolwarm',
    ax=ax,
    xticklabels=True,
    yticklabels=True,
    linewidths=.5
)
ax.set_xlabel('Year-Month')
ax.set_ylabel('State')
ax.set_title('Number of incidents per 100k inhabitants')

plt.xticks(rotation=90)
plt.tight_layout()
In [ ]:
incidents_per_month_per_state = incidents_df[incidents_df['incident_characteristics1']!='Non-Shooting Incident'].groupby(['state', 'month', 'year']).size()
incidents_per_month_per_state = incidents_per_month_per_state.to_frame(name='incidents').reset_index()
incidents_per_month_per_state = incidents_per_month_per_state.sort_values(by=['year', 'month', 'state'], ignore_index=True)
incidents_per_month_per_state['incidents_per_100k_inhabitants'] = incidents_per_month_per_state.apply(
    lambda row: (row['incidents'] / usa_population_df[usa_population_df['state']==row['state']]['population_state_2010'].iloc[0])*100000,
    axis=1
)
fig, ax = plt.subplots(figsize=(20, 10))
sns.heatmap(
    incidents_per_month_per_state[incidents_per_month_per_state.year<=2020].pivot(
        index='state',
        columns=['year', 'month'],
        values='incidents_per_100k_inhabitants'
    ).fillna(0),
    cmap='coolwarm',
    ax=ax,
    xticklabels=True,
    yticklabels=True,
    linewidths=.5
)
ax.set_xlabel('Year-Month')
ax.set_ylabel('State')
ax.set_title('Number of incidents per 100k inhabitants (excluding non-shooting incidents)')

plt.xticks(rotation=90)
plt.tight_layout()

We exclude District of Columbia:

In [ ]:
fig, ax = plt.subplots(figsize=(20, 10))
sns.heatmap(
    incidents_per_month_per_state[(incidents_per_month_per_state.year<=2020) & (incidents_per_month_per_state['state']!='DISTRICT OF COLUMBIA')].pivot(
        index='state',
        columns=['year', 'month'],
        values='incidents_per_100k_inhabitants'
    ).fillna(0),
    cmap='coolwarm',
    ax=ax,
    xticklabels=True,
    yticklabels=True,
    linewidths=.5
)
ax.set_xlabel('Year-Month')
ax.set_ylabel('State')
ax.set_title('Number of incidents per 100k inhabitants')


plt.xticks(rotation=90)
plt.tight_layout()

fig.savefig("../html/heatmap_incidents_months.svg")

We plot the correlations between the number of incidents and the poverty percentage in each state coloring the points according to the party that got the majority of votes in that state:

In [ ]:
winning_party_per_state = pd.read_csv(DATA_FOLDER_PATH + '../data/winning_party_per_state.csv')
winning_party_per_state_copy = winning_party_per_state.copy()
winning_party_per_state_copy['year'] = winning_party_per_state['year'] + 1
winning_party_per_state = pd.concat([winning_party_per_state, winning_party_per_state_copy], ignore_index=True)
incidents_df = incidents_df[incidents_df['year'].notna()].merge(winning_party_per_state[['state', 'year', 'majority_state_party']], on=['state', 'year'], how='left')

incidents_per_state_2016 = incidents_df[(incidents_df['n_killed']>0)].groupby(['state', 'year', 'population_state_2010', 'povertyPercentage', 'majority_state_party']).size()
incidents_per_state_2016 = incidents_per_state_2016.to_frame(name='incidents').reset_index()
incidents_per_state_2016['incidents_per_100k_inhabitants'] = (incidents_per_state_2016['incidents'] / incidents_per_state_2016['population_state_2010'])*100000
fig = px.scatter(
    incidents_per_state_2016,
    x='povertyPercentage',
    y='incidents_per_100k_inhabitants',
    color='majority_state_party',
    hover_name='state',
    hover_data={'povertyPercentage': True, 'incidents_per_100k_inhabitants': True},
    title='Mortal gun incidents in the USA',
    facet_col="year",
    facet_col_wrap=3,
    height=800
)
pyo.plot(fig, filename='../html/scatter_poverty.html', auto_open=False)
fig.show()

The two attributes are slightly correlated.

We plot the correlations between each attribute:

In [ ]:
numerical_columns = incidents_df.select_dtypes(include=['float64', 'int64']).columns
fig = plt.figure(figsize=(15, 12))
corr_matrix = incidents_df[numerical_columns].corr(method='pearson')
sns.heatmap(corr_matrix, mask=np.triu(corr_matrix));
plt.title('Correlation matrix (Pearson)');
fig.savefig("../html/attributes_correlation_matrix.svg")

The attributes min_age_participants, avg_age_participants and max_age_participants are positively correlated with each other (probably because there are many incidents involving a single person). n_participants is positively correlated with n_participants_adults and n_males: most of the incidents involve adult males. povertyPercentage is inversly correlated with latitude: southern states are poorer.

We re-order the columns and we save the cleaned dataset:

In [ ]:
time_columns = ['date', 'date_original', 'year', 'month', 'day', 'day_of_week']

geo_columns = ['state', 'address', 'latitude', 'longitude',
               'county', 'city', 'location_importance', 'address_type',
               'congressional_district', 'state_house_district', 'state_senate_district',
               'px_code']

participants_columns = ['participant_age1', 'participant1_child',
       'participant1_teen', 'participant1_adult', 'participant1_male',
       'participant1_female', 'min_age_participants', 'avg_age_participants', 'max_age_participants',
       'n_participants_child', 'n_participants_teen', 'n_participants_adult',
       'n_males', 'n_females', 'n_killed', 'n_injured', 'n_arrested',
       'n_unharmed', 'n_participants']

characteristic_columns = ['notes', 'incident_characteristics1', 'incident_characteristics2', 
    'firearm', 'air_gun', 'shots', 'aggression', 'suicide', 'injuries',
    'death', 'road', 'illegal_holding', 'house', 'school', 'children',
    'drugs', 'officers', 'organized', 'social_reasons', 'defensive',
    'workplace', 'abduction', 'unintentional']

external_columns = ['povertyPercentage', 'party', 'candidatevotes', 'totalvotes', 'candidateperc', 'population_state_2010']

incidents_df = incidents_df[time_columns + geo_columns + participants_columns + characteristic_columns + external_columns]
incidents_df = incidents_df.rename(
    columns={
        'povertyPercentage': 'poverty_perc',
        'candidatevotes': 'candidate_votes',
        'totalvotes': 'total_votes',
        'candidateperc': 'candidate_perc'
    }
)

incidents_df.to_csv(DATA_FOLDER_PATH +'incidents_cleaned.csv', index=True)